BiomarkerKB Resource Integration: Difference between revisions
No edit summary |
|||
| (21 intermediate revisions by 4 users not shown) | |||
| Line 1: | Line 1: | ||
BiomarkerKB collects data from | BiomarkerKB collects data from a wide range of resources. Not all collected data are directly integrated into the core data model; some are included as contextual annotations or cross-references to enrich existing entries. | ||
= Resources for Exploration = | |||
*[https://cadsr.cancer.gov/onedata/Home.jsp CADSR Cancer] | |||
*[https://themarker.idrblab.cn/ Marker Database] | |||
*biomarker.org | |||
*ResMarkerDB | |||
*SalivaDB | |||
*[https://glycanage.com/publications GlycanAge Publications] | |||
*[https://www.cancergenomeinterpreter.org/biomarkers Cancer Genome Interpreter (Biomarkers)] | |||
*[https://github.com/issues/assigned?issue=clinical-biomarkers%7Cbiomarker-issue-repo%7C248 Glycan Biomarkers] ([https://github.com/glygener/CarboCurator code]) | |||
*[https://www.alliancegenome.org/ Alliance Genome] | |||
For suggestions of additional biomarker data resources, please contact: mazumder_lab@gwu.edu | |||
= Data Sources = | |||
== GWAS == | |||
'''Status''': Direct integration into data model | |||
* Genome-wide association studies (GWAS) provide biomarkers in the form of SNPs. | |||
* The GWAS Catalog includes SNPs associated with a wide range of diseases. | |||
** Preliminary curation has only focused on cancer. | |||
** As of 12/11/2026, biomarkers for all available conditions in the GWAS Catalog have been integrated. | |||
* '''License''': CC BY-NC 4.0 | |||
= | == MetaKB == | ||
Status: Direct Integration | '''Status''': Direct integration into data model | ||
* Provides harmonized associations between cancer genomic variants, diseases, and therapeutic evidence. | |||
* Aggregates and standardizes variant interpretation data from six major knowledgebases: | |||
** Clinical Interpretation of Variants in Cancer (CIViC) ''(integrated)'' | |||
** OncoKB ''(restricted from commercial use)'' | |||
** The Jackson Laboratory Clinical Knowledgebase (JAX-CKB) ''(restricted from commercial use and has share-alike requirements for non-commercial use)'' | |||
** MolecularMatch ''(restricted from commercial use)'' | |||
** Precision Medicine Knowledgebase) ''(pending integration)'' | |||
** Cancer Genome Interpreter (CGI) – through its ''Cancer Biomarkers Database'' component ''(integrated)'' | |||
* Enables mapping of: | |||
** Variant → Disease → Drug relationships | |||
** Evidence levels and citations | |||
** Ontology-aligned entities (genes, variants, diseases, drugs) | |||
* Notes: | |||
** Requires validation of entity mappings against BiomarkerKB schema | |||
* Focused on somatic variant–based biomarkers; contextual attributes such as tissue type, therapy response, or evidence type can be inferred or imputed where not directly specified. | |||
* Manual curation may be required for entries with incomplete evidence annotation or lacking standard ontology references. | |||
* Integration approach: direct mapping of variant, condition, and evidence entities; cross-references retained to original data sources. | |||
* License: Aggregated data are available for non-commercial, research use only, respecting constituent licenses: | |||
** CIViC – CC0 (Public Domain) | |||
** PMKB – CC-BY 4.0 | |||
** CGI – CC0 for biomarkers database, CC-BY-NC 4.0 for tool | |||
** JAX-CKB – CC-BY-NC-SA 4.0 | |||
** OncoKB – custom non-commercial license | |||
** MolecularMatch – restricted commercial use | |||
** MetaKB codebase – MIT license | |||
* Overall usage requires adherence to non-commercial research terms; commercial use needs separate permissions from individual data providers. | |||
== Glycan LLM Biomarkers == | |||
* | '''Status''': Direct integration into data model | ||
* | * LangChain LLM method used to collect biomarkers from PubMed Central abstracts | ||
* Method identifies glycan entities and changes mentioned in them associated to disease | |||
'''Note''': only biomarkers with <code>assessed_entity_type: protein</code> were integrated, with the goal of expanding to glycan entity types once the Glycan Structure Dictionary is finalized. | |||
= | == Top 50 Biomarkers == | ||
Status: Direct | '''Status''': Direct integration into data model | ||
* Biomarkers collected during Summer Volunteership | |||
* Volunteers identified top 50 biomarker entities from BiomarkerKB | |||
* Using this information the top 50 biomarker entities were searched in PubMed | |||
* 100 biomarkers were manually curated | |||
== EDRN == | |||
* | '''Status''': Sample integration into data model | ||
* | * Cancer biomarkers. | ||
* | * Sample of EDRN Biomarkers provided from EDRN LLM method | ||
* Biomarkers are extracted from free text in EDRN publicly available biomarkers | |||
= | == LOINC == | ||
Status: | '''Status''': Direct integration into data model | ||
* Metabolite data only | |||
* We are currently working with the Metabolomics Workbench group to get the complete data | |||
* | == OncoKB == | ||
'''Status''': Cross-Reference | |||
* Provides useful information on drugs and therapy options for different biomarker entities. | |||
* Also provides information based on what condition the entity is related to. | |||
* '''License''': A license is required to use OncoKB for commercial and/or clinical purposes, and to access OncoKB data programmatically for academic purposes. | |||
* Paid license is required | |||
* Cross-reference from biomarkers in BiomarkerKB to the appropriate drug information and therapy information is the best solution. | |||
= | == HPO == | ||
Status: | '''Status''': Cross-Reference | ||
* HPO provides disease and entity associations. | |||
* Does not provide a change within the entity so we cannot collect biomarker data from here. | |||
* However we can use it as a cross-reference within our cross-referencing section. | |||
* Provides cross-reference to OMIM, SNOMED, and MONDO. | |||
* | == UniProtKB == | ||
* | '''Status''': Direct integration into data model | ||
* | * Can provide biomarker (change in entity), entity, condition, and sampling data. | ||
** | * This data is in a text file that has to be reviewed fully and to make sure it will be able to be automatically extracted. | ||
** | * Contextual information can be imputed if necessary. | ||
* License: Creative Commons Attribution | * In UniProt there are found_in and entries that are actual biomarkers: | ||
** found_in will get a cross-reference; | |||
** actual biomarkers will be directly integrated. | |||
* Manual curation of 56 reviewed entries with mention of "biomarker" in flat text file. | |||
* '''License''': Creative Commons Attribution 4.0 International (CC BY 4.0). | |||
= | == CIViC == | ||
'''Status''': Direct integration into data model | |||
* Clinical Interpretation of Variants in Cancer (CIViC). | |||
* Provides cancer biomarkers in form of DNA mutations (dbSNPs). | |||
* Platform provides clinicians treatment options for patients based on unique tumor profile. | |||
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License. | |||
Status: | == ClinVar == | ||
'''Status''': Direct integration into data model | |||
* Public archive of reports of human variations classified for diseases and drug responses. | |||
* Provides biomarkers for all disease, but we have only curated cancer biomarkers for now. | |||
** dbSNPs | |||
** File is really big but will go back and use existing script to map all biomarkers from here into the data model. | |||
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License. | |||
'''Note''': Only biomarkers from "cancer" and "carcinoma" tags were pulled. Pending integration of biomarkers for all diseases. | |||
== MarkerDB == | |||
* | '''Status''': Cross-Reference | ||
* | * Provides a lot of useful biomarker data and cross-references other resources as well. | ||
* | * Information includes: panel information, abnormal levels of biomarkers by disease, structural information, etc. | ||
* | * Annotations that can be cross-referenced include the above. | ||
* By cross-referencing, BiomarkerKB will allow users to find more information for specific biomarkers and move towards the goal of being a comprehensive resource for biomarkers. | |||
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License. | |||
= | == Metabolomics Workbench == | ||
Status: | '''Status''': Direct integration into data model | ||
''Data provided by Metabolomics Workbench'' | ''Data provided by Metabolomics Workbench'' | ||
* Metabolite biomarkers utilized in the uniform newborn screening program. | * Metabolite biomarkers utilized in the uniform newborn screening program. | ||
* Detect treatable disorders that are life threatening or having long-term morbidity, before they become symptomatic. | * Detect treatable disorders that are life threatening or having long-term morbidity, before they become symptomatic. | ||
= | == OncoMX == | ||
'''Status''': Direct integration into data model | |||
* Integrated cancer mutation and expression resource for exploring cancer biomarkers | |||
= | |||
Status: Direct | |||
* | |||
* Manual curation effort by GWU and JPL | * Manual curation effort by GWU and JPL | ||
* Over 600 single and panel biomarkers | * Over 600 single and panel biomarkers | ||
* License: Creative Commons Attribution-NonCommercial 4.0 International License. | * '''License''': Creative Commons Attribution-NonCommercial 4.0 International License. | ||
* Collects potential drug targets and therapeutic targets | == OpenTargets == | ||
* Some effort was required to find the correct biomarker data | '''Status''': Direct integration into data model | ||
* 1200 biomarkers collected | * Collects potential drug targets and therapeutic targets. | ||
* Some effort was required to find the correct biomarker data. | |||
* 1200 biomarkers collected. | |||
** dbSNPs related to cancer and other disease | ** dbSNPs related to cancer and other disease | ||
* License: Creative Commons Attribution-NonCommercial 4.0 International License. | * '''License''': Creative Commons Attribution-NonCommercial 4.0 International License. | ||
'''Note''': Only cancer data was integrated. | |||
=PubMed Central Biomarker Gene Set | == PubMed Central Biomarker Gene Set == | ||
Status: Direct | '''Status''': Direct integration into data model | ||
''Data provided by Avi Ma'ayan's LINCS group'' | ''Data provided by Avi Ma'ayan's LINCS group'' | ||
* This data set was created through manual curation of biomarker gene sets on Pubmed Central using the results of gene sets returned from Rummagene. | * This data set was created through manual curation of biomarker gene sets on Pubmed Central using the results of gene sets returned from Rummagene. | ||
* Using the outputted search results within the Rummagene web server, we manually identified publications that associated different conditions and environmental exposures to biomarker gene sets. | * Using the outputted search results within the Rummagene web server, we manually identified publications that associated different conditions and environmental exposures to biomarker gene sets. | ||
| Line 106: | Line 158: | ||
* The primary use case for this data is to identify biomarker panels/ gene sets associated with conditions. | * The primary use case for this data is to identify biomarker panels/ gene sets associated with conditions. | ||
= | == SenNet == | ||
'''Status''': Direction integration into data model | |||
Status: | * Cell senescence biomarkers from SenNet group | ||
* Biomarker data was collected and incorporated however biomarker field was incomplete and data integrated was given a score of -2 | |||
* | * Data is still valuable as contextual data and can be revisited to complete biomarker field in future | ||
* | For infomation about Cross-references and Annotations in BiomarkerKB please visit - [[Xrefs and annotations]] | ||
Latest revision as of 16:08, 20 March 2026
BiomarkerKB collects data from a wide range of resources. Not all collected data are directly integrated into the core data model; some are included as contextual annotations or cross-references to enrich existing entries.
Resources for Exploration
- CADSR Cancer
- Marker Database
- biomarker.org
- ResMarkerDB
- SalivaDB
- GlycanAge Publications
- Cancer Genome Interpreter (Biomarkers)
- Glycan Biomarkers (code)
- Alliance Genome
For suggestions of additional biomarker data resources, please contact: mazumder_lab@gwu.edu
Data Sources
GWAS
Status: Direct integration into data model
- Genome-wide association studies (GWAS) provide biomarkers in the form of SNPs.
- The GWAS Catalog includes SNPs associated with a wide range of diseases.
- Preliminary curation has only focused on cancer.
- As of 12/11/2026, biomarkers for all available conditions in the GWAS Catalog have been integrated.
- License: CC BY-NC 4.0
MetaKB
Status: Direct integration into data model
- Provides harmonized associations between cancer genomic variants, diseases, and therapeutic evidence.
- Aggregates and standardizes variant interpretation data from six major knowledgebases:
- Clinical Interpretation of Variants in Cancer (CIViC) (integrated)
- OncoKB (restricted from commercial use)
- The Jackson Laboratory Clinical Knowledgebase (JAX-CKB) (restricted from commercial use and has share-alike requirements for non-commercial use)
- MolecularMatch (restricted from commercial use)
- Precision Medicine Knowledgebase) (pending integration)
- Cancer Genome Interpreter (CGI) – through its Cancer Biomarkers Database component (integrated)
- Enables mapping of:
- Variant → Disease → Drug relationships
- Evidence levels and citations
- Ontology-aligned entities (genes, variants, diseases, drugs)
- Notes:
- Requires validation of entity mappings against BiomarkerKB schema
- Focused on somatic variant–based biomarkers; contextual attributes such as tissue type, therapy response, or evidence type can be inferred or imputed where not directly specified.
- Manual curation may be required for entries with incomplete evidence annotation or lacking standard ontology references.
- Integration approach: direct mapping of variant, condition, and evidence entities; cross-references retained to original data sources.
- License: Aggregated data are available for non-commercial, research use only, respecting constituent licenses:
- CIViC – CC0 (Public Domain)
- PMKB – CC-BY 4.0
- CGI – CC0 for biomarkers database, CC-BY-NC 4.0 for tool
- JAX-CKB – CC-BY-NC-SA 4.0
- OncoKB – custom non-commercial license
- MolecularMatch – restricted commercial use
- MetaKB codebase – MIT license
- Overall usage requires adherence to non-commercial research terms; commercial use needs separate permissions from individual data providers.
Glycan LLM Biomarkers
Status: Direct integration into data model
- LangChain LLM method used to collect biomarkers from PubMed Central abstracts
- Method identifies glycan entities and changes mentioned in them associated to disease
Note: only biomarkers with assessed_entity_type: protein were integrated, with the goal of expanding to glycan entity types once the Glycan Structure Dictionary is finalized.
Top 50 Biomarkers
Status: Direct integration into data model
- Biomarkers collected during Summer Volunteership
- Volunteers identified top 50 biomarker entities from BiomarkerKB
- Using this information the top 50 biomarker entities were searched in PubMed
- 100 biomarkers were manually curated
EDRN
Status: Sample integration into data model
- Cancer biomarkers.
- Sample of EDRN Biomarkers provided from EDRN LLM method
- Biomarkers are extracted from free text in EDRN publicly available biomarkers
LOINC
Status: Direct integration into data model
- Metabolite data only
- We are currently working with the Metabolomics Workbench group to get the complete data
OncoKB
Status: Cross-Reference
- Provides useful information on drugs and therapy options for different biomarker entities.
- Also provides information based on what condition the entity is related to.
- License: A license is required to use OncoKB for commercial and/or clinical purposes, and to access OncoKB data programmatically for academic purposes.
- Paid license is required
- Cross-reference from biomarkers in BiomarkerKB to the appropriate drug information and therapy information is the best solution.
HPO
Status: Cross-Reference
- HPO provides disease and entity associations.
- Does not provide a change within the entity so we cannot collect biomarker data from here.
- However we can use it as a cross-reference within our cross-referencing section.
- Provides cross-reference to OMIM, SNOMED, and MONDO.
UniProtKB
Status: Direct integration into data model
- Can provide biomarker (change in entity), entity, condition, and sampling data.
- This data is in a text file that has to be reviewed fully and to make sure it will be able to be automatically extracted.
- Contextual information can be imputed if necessary.
- In UniProt there are found_in and entries that are actual biomarkers:
- found_in will get a cross-reference;
- actual biomarkers will be directly integrated.
- Manual curation of 56 reviewed entries with mention of "biomarker" in flat text file.
- License: Creative Commons Attribution 4.0 International (CC BY 4.0).
CIViC
Status: Direct integration into data model
- Clinical Interpretation of Variants in Cancer (CIViC).
- Provides cancer biomarkers in form of DNA mutations (dbSNPs).
- Platform provides clinicians treatment options for patients based on unique tumor profile.
- License: Creative Commons Attribution-NonCommercial 4.0 International License.
ClinVar
Status: Direct integration into data model
- Public archive of reports of human variations classified for diseases and drug responses.
- Provides biomarkers for all disease, but we have only curated cancer biomarkers for now.
- dbSNPs
- File is really big but will go back and use existing script to map all biomarkers from here into the data model.
- License: Creative Commons Attribution-NonCommercial 4.0 International License.
Note: Only biomarkers from "cancer" and "carcinoma" tags were pulled. Pending integration of biomarkers for all diseases.
MarkerDB
Status: Cross-Reference
- Provides a lot of useful biomarker data and cross-references other resources as well.
- Information includes: panel information, abnormal levels of biomarkers by disease, structural information, etc.
- Annotations that can be cross-referenced include the above.
- By cross-referencing, BiomarkerKB will allow users to find more information for specific biomarkers and move towards the goal of being a comprehensive resource for biomarkers.
- License: Creative Commons Attribution-NonCommercial 4.0 International License.
Metabolomics Workbench
Status: Direct integration into data model
Data provided by Metabolomics Workbench
- Metabolite biomarkers utilized in the uniform newborn screening program.
- Detect treatable disorders that are life threatening or having long-term morbidity, before they become symptomatic.
OncoMX
Status: Direct integration into data model
- Integrated cancer mutation and expression resource for exploring cancer biomarkers
- Manual curation effort by GWU and JPL
- Over 600 single and panel biomarkers
- License: Creative Commons Attribution-NonCommercial 4.0 International License.
OpenTargets
Status: Direct integration into data model
- Collects potential drug targets and therapeutic targets.
- Some effort was required to find the correct biomarker data.
- 1200 biomarkers collected.
- dbSNPs related to cancer and other disease
- License: Creative Commons Attribution-NonCommercial 4.0 International License.
Note: Only cancer data was integrated.
PubMed Central Biomarker Gene Set
Status: Direct integration into data model
Data provided by Avi Ma'ayan's LINCS group
- This data set was created through manual curation of biomarker gene sets on Pubmed Central using the results of gene sets returned from Rummagene.
- Using the outputted search results within the Rummagene web server, we manually identified publications that associated different conditions and environmental exposures to biomarker gene sets.
- The biomarker gene sets were retrieved through the validation of the gene mentioned within each of the publications.
- The primary use case for this data is to identify biomarker panels/ gene sets associated with conditions.
SenNet
Status: Direction integration into data model
- Cell senescence biomarkers from SenNet group
- Biomarker data was collected and incorporated however biomarker field was incomplete and data integrated was given a score of -2
- Data is still valuable as contextual data and can be revisited to complete biomarker field in future
For infomation about Cross-references and Annotations in BiomarkerKB please visit - Xrefs and annotations