BiomarkerKB Resource Integration: Difference between revisions
No edit summary |
No edit summary |
||
| (43 intermediate revisions by 4 users not shown) | |||
| Line 1: | Line 1: | ||
BiomarkerKB collects data from | BiomarkerKB collects data from a wide range of resources. Not all collected data are directly integrated into the core data model; some are included as contextual annotations or cross-references to enrich existing entries. | ||
= Resources for Exploration = | |||
*[https://themarker.idrblab.cn/ Marker Database] | |||
*ResMarkerDB | |||
*SalivaDB | |||
*[https://glycanage.com/publications GlycanAge Publications] | |||
*[https://www.cancergenomeinterpreter.org/biomarkers Cancer Genome Interpreter (Biomarkers)] | |||
*[https://github.com/issues/assigned?issue=clinical-biomarkers%7Cbiomarker-issue-repo%7C248 Glycan Biomarkers] ([https://github.com/glygener/CarboCurator code]) | |||
*[https://www.alliancegenome.org/ Alliance Genome] | |||
For suggestions of additional biomarker data resources, please contact: mazumder_lab@gwu.edu | |||
= | = Data Sources = | ||
Status: | == GWAS == | ||
'''Status''': Direct integration into data model | |||
* Genome-wide association studies (GWAS) provide biomarkers in the form of SNPs. | |||
* The GWAS Catalog includes SNPs associated with a wide range of diseases. | |||
** Preliminary curation has only focused on cancer. | |||
** As of 12/11/2026, biomarkers for all available conditions in the GWAS Catalog have been integrated. | |||
* '''License''': CC BY-NC 4.0 | |||
* Provides | == MetaKB == | ||
* | '''Status''': Direct integration into data model | ||
* | * Provides harmonized associations between cancer genomic variants, diseases, and therapeutic evidence. | ||
* | * Aggregates and standardizes variant interpretation data from six major knowledgebases: | ||
* | ** Clinical Interpretation of Variants in Cancer (CIViC) ''(integrated)'' | ||
* | ** OncoKB ''(restricted from commercial use)'' | ||
** The Jackson Laboratory Clinical Knowledgebase (JAX-CKB) ''(restricted from commercial use and has share-alike requirements for non-commercial use)'' | |||
** MolecularMatch ''(restricted from commercial use)'' | |||
** Precision Medicine Knowledgebase) ''(pending integration)'' | |||
** Cancer Genome Interpreter (CGI) – through its ''Cancer Biomarkers Database'' component ''(integrated)'' | |||
* Enables mapping of: | |||
** Variant → Disease → Drug relationships | |||
** Evidence levels and citations | |||
** Ontology-aligned entities (genes, variants, diseases, drugs) | |||
* Notes: | |||
** Requires validation of entity mappings against BiomarkerKB schema | |||
* Focused on somatic variant–based biomarkers; contextual attributes such as tissue type, therapy response, or evidence type can be inferred or imputed where not directly specified. | |||
* Manual curation may be required for entries with incomplete evidence annotation or lacking standard ontology references. | |||
* Integration approach: direct mapping of variant, condition, and evidence entities; cross-references retained to original data sources. | |||
* License: Aggregated data are available for non-commercial, research use only, respecting constituent licenses: | |||
** CIViC – CC0 (Public Domain) | |||
** PMKB – CC-BY 4.0 | |||
** CGI – CC0 for biomarkers database, CC-BY-NC 4.0 for tool | |||
** JAX-CKB – CC-BY-NC-SA 4.0 | |||
** OncoKB – custom non-commercial license | |||
** MolecularMatch – restricted commercial use | |||
** MetaKB codebase – MIT license | |||
* Overall usage requires adherence to non-commercial research terms; commercial use needs separate permissions from individual data providers. | |||
= | == Glycan LLM Biomarkers == | ||
Status: | '''Status''': Direct integration into data model | ||
* LangChain LLM method used to collect biomarkers from PubMed Central abstracts | |||
* Method identifies glycan entities and changes mentioned in them associated to disease | |||
'''Note''': only biomarkers with <code>assessed_entity_type: protein</code> were integrated, with the goal of expanding to glycan entity types once the Glycan Structure Dictionary is finalized. | |||
* Provides useful information on drugs and therapy options for different biomarker entities | == Top 50 Biomarkers == | ||
* Also provides information based on what condition the entity is related to | '''Status''': Direct integration into data model | ||
* License: A license is required to use OncoKB for commercial and/or clinical purposes, and to access OncoKB data programmatically for academic purposes. | * Biomarkers collected during Summer Volunteership | ||
* Volunteers identified top 50 biomarker entities from BiomarkerKB | |||
* Using this information the top 50 biomarker entities were searched in PubMed | |||
* 100 biomarkers were manually curated | |||
== EDRN == | |||
'''Status''': Sample integration into data model | |||
* Cancer biomarkers. | |||
* Sample of EDRN Biomarkers provided from EDRN LLM method | |||
* Biomarkers are extracted from free text in EDRN publicly available biomarkers | |||
== LOINC == | |||
'''Status''': Direct integration into data model | |||
* Metabolite data only | |||
* We are currently working with the Metabolomics Workbench group to get the complete data | |||
== OncoKB == | |||
'''Status''': Cross-Reference | |||
* Provides useful information on drugs and therapy options for different biomarker entities. | |||
* Also provides information based on what condition the entity is related to. | |||
* '''License''': A license is required to use OncoKB for commercial and/or clinical purposes, and to access OncoKB data programmatically for academic purposes. | |||
* Paid license is required | * Paid license is required | ||
* Cross reference from biomarkers in BiomarkerKB to the appropriate drug information and therapy information is the best solution | * Cross-reference from biomarkers in BiomarkerKB to the appropriate drug information and therapy information is the best solution. | ||
== HPO == | |||
'''Status''': Cross-Reference | |||
* HPO provides disease and entity associations. | |||
* Does not provide a change within the entity so we cannot collect biomarker data from here. | |||
* However we can use it as a cross-reference within our cross-referencing section. | |||
* Provides cross-reference to OMIM, SNOMED, and MONDO. | |||
== UniProtKB == | |||
'''Status''': Direct integration into data model | |||
* Can provide biomarker (change in entity), entity, condition, and sampling data. | |||
* This data is in a text file that has to be reviewed fully and to make sure it will be able to be automatically extracted. | |||
* Contextual information can be imputed if necessary. | |||
* In UniProt there are found_in and entries that are actual biomarkers: | |||
** found_in will get a cross-reference; | |||
** actual biomarkers will be directly integrated. | |||
* Manual curation of 56 reviewed entries with mention of "biomarker" in flat text file. | |||
* '''License''': Creative Commons Attribution 4.0 International (CC BY 4.0). | |||
== CIViC == | |||
'''Status''': Direct integration into data model | |||
* Clinical Interpretation of Variants in Cancer (CIViC). | |||
* Provides cancer biomarkers in form of DNA mutations (dbSNPs). | |||
* Platform provides clinicians treatment options for patients based on unique tumor profile. | |||
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License. | |||
== ClinVar == | |||
'''Status''': Direct integration into data model | |||
* Public archive of reports of human variations classified for diseases and drug responses. | |||
* Provides biomarkers for all disease, but we have only curated cancer biomarkers for now. | |||
** dbSNPs | |||
** File is really big but will go back and use existing script to map all biomarkers from here into the data model. | |||
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License. | |||
'''Note''': Only biomarkers from "cancer" and "carcinoma" tags were pulled. Pending integration of biomarkers for all diseases. | |||
== MarkerDB == | |||
'''Status''': Cross-Reference | |||
* Provides a lot of useful biomarker data and cross-references other resources as well. | |||
* Information includes: panel information, abnormal levels of biomarkers by disease, structural information, etc. | |||
* Annotations that can be cross-referenced include the above. | |||
* By cross-referencing, BiomarkerKB will allow users to find more information for specific biomarkers and move towards the goal of being a comprehensive resource for biomarkers. | |||
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License. | |||
== Metabolomics Workbench == | |||
'''Status''': Direct integration into data model | |||
''Data provided by Metabolomics Workbench'' | |||
* Metabolite biomarkers utilized in the uniform newborn screening program. | |||
* Detect treatable disorders that are life threatening or having long-term morbidity, before they become symptomatic. | |||
== OncoMX == | |||
'''Status''': Direct integration into data model | |||
* Integrated cancer mutation and expression resource for exploring cancer biomarkers | |||
* Manual curation effort by GWU and JPL | |||
* Over 600 single and panel biomarkers | |||
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License. | |||
== OpenTargets == | |||
'''Status''': Direct integration into data model | |||
* Collects potential drug targets and therapeutic targets. | |||
* Some effort was required to find the correct biomarker data. | |||
* 1200 biomarkers collected. | |||
** dbSNPs related to cancer and other disease | |||
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License. | |||
'''Note''': Only cancer data was integrated. | |||
== PubMed Central Biomarker Gene Set == | |||
'''Status''': Direct integration into data model | |||
''Data provided by Avi Ma'ayan's LINCS group'' | |||
* This data set was created through manual curation of biomarker gene sets on Pubmed Central using the results of gene sets returned from Rummagene. | |||
* Using the outputted search results within the Rummagene web server, we manually identified publications that associated different conditions and environmental exposures to biomarker gene sets. | |||
* The biomarker gene sets were retrieved through the validation of the gene mentioned within each of the publications. | |||
* The primary use case for this data is to identify biomarker panels/ gene sets associated with conditions. | |||
== SenNet == | |||
'''Status''': Direction integration into data model | |||
* Cell senescence biomarkers from SenNet group | |||
* Biomarker data was collected and incorporated however biomarker field was incomplete and data integrated was given a score of -2 | |||
* Data is still valuable as contextual data and can be revisited to complete biomarker field in future | |||
For infomation about Cross-references and Annotations in BiomarkerKB please visit - [[Xrefs and annotations]] | |||
= Pending Resources = | |||
== biomarker.org == | |||
Reached out on March 17th, 2026 regarding data access and sent follow-up communications; however, no response was received. | |||
== [https://cadsr.cancer.gov/onedata/Home.jsp caDSR] == | |||
The Cancer Data Standards Registry and Repository (caDSR) is a metadata registry, not a biomarker knowledge source. It defines Common Data Elements (CDEs), including field names, definitions, and controlled value sets, but does not contain biomarker-condition relationships or evidence. | |||
For BiomarkerKB, it could potentially be valuable for schema standardization rather than data ingestion. For example, align fields like condition, specimen, and entity type to controlled vocabularies (via NCI Thesaurus). | |||
Recommendation: not ingestible as a data source. | |||
Latest revision as of 02:37, 18 April 2026
BiomarkerKB collects data from a wide range of resources. Not all collected data are directly integrated into the core data model; some are included as contextual annotations or cross-references to enrich existing entries.
Resources for Exploration
- Marker Database
- ResMarkerDB
- SalivaDB
- GlycanAge Publications
- Cancer Genome Interpreter (Biomarkers)
- Glycan Biomarkers (code)
- Alliance Genome
For suggestions of additional biomarker data resources, please contact: mazumder_lab@gwu.edu
Data Sources
GWAS
Status: Direct integration into data model
- Genome-wide association studies (GWAS) provide biomarkers in the form of SNPs.
- The GWAS Catalog includes SNPs associated with a wide range of diseases.
- Preliminary curation has only focused on cancer.
- As of 12/11/2026, biomarkers for all available conditions in the GWAS Catalog have been integrated.
- License: CC BY-NC 4.0
MetaKB
Status: Direct integration into data model
- Provides harmonized associations between cancer genomic variants, diseases, and therapeutic evidence.
- Aggregates and standardizes variant interpretation data from six major knowledgebases:
- Clinical Interpretation of Variants in Cancer (CIViC) (integrated)
- OncoKB (restricted from commercial use)
- The Jackson Laboratory Clinical Knowledgebase (JAX-CKB) (restricted from commercial use and has share-alike requirements for non-commercial use)
- MolecularMatch (restricted from commercial use)
- Precision Medicine Knowledgebase) (pending integration)
- Cancer Genome Interpreter (CGI) – through its Cancer Biomarkers Database component (integrated)
- Enables mapping of:
- Variant → Disease → Drug relationships
- Evidence levels and citations
- Ontology-aligned entities (genes, variants, diseases, drugs)
- Notes:
- Requires validation of entity mappings against BiomarkerKB schema
- Focused on somatic variant–based biomarkers; contextual attributes such as tissue type, therapy response, or evidence type can be inferred or imputed where not directly specified.
- Manual curation may be required for entries with incomplete evidence annotation or lacking standard ontology references.
- Integration approach: direct mapping of variant, condition, and evidence entities; cross-references retained to original data sources.
- License: Aggregated data are available for non-commercial, research use only, respecting constituent licenses:
- CIViC – CC0 (Public Domain)
- PMKB – CC-BY 4.0
- CGI – CC0 for biomarkers database, CC-BY-NC 4.0 for tool
- JAX-CKB – CC-BY-NC-SA 4.0
- OncoKB – custom non-commercial license
- MolecularMatch – restricted commercial use
- MetaKB codebase – MIT license
- Overall usage requires adherence to non-commercial research terms; commercial use needs separate permissions from individual data providers.
Glycan LLM Biomarkers
Status: Direct integration into data model
- LangChain LLM method used to collect biomarkers from PubMed Central abstracts
- Method identifies glycan entities and changes mentioned in them associated to disease
Note: only biomarkers with assessed_entity_type: protein were integrated, with the goal of expanding to glycan entity types once the Glycan Structure Dictionary is finalized.
Top 50 Biomarkers
Status: Direct integration into data model
- Biomarkers collected during Summer Volunteership
- Volunteers identified top 50 biomarker entities from BiomarkerKB
- Using this information the top 50 biomarker entities were searched in PubMed
- 100 biomarkers were manually curated
EDRN
Status: Sample integration into data model
- Cancer biomarkers.
- Sample of EDRN Biomarkers provided from EDRN LLM method
- Biomarkers are extracted from free text in EDRN publicly available biomarkers
LOINC
Status: Direct integration into data model
- Metabolite data only
- We are currently working with the Metabolomics Workbench group to get the complete data
OncoKB
Status: Cross-Reference
- Provides useful information on drugs and therapy options for different biomarker entities.
- Also provides information based on what condition the entity is related to.
- License: A license is required to use OncoKB for commercial and/or clinical purposes, and to access OncoKB data programmatically for academic purposes.
- Paid license is required
- Cross-reference from biomarkers in BiomarkerKB to the appropriate drug information and therapy information is the best solution.
HPO
Status: Cross-Reference
- HPO provides disease and entity associations.
- Does not provide a change within the entity so we cannot collect biomarker data from here.
- However we can use it as a cross-reference within our cross-referencing section.
- Provides cross-reference to OMIM, SNOMED, and MONDO.
UniProtKB
Status: Direct integration into data model
- Can provide biomarker (change in entity), entity, condition, and sampling data.
- This data is in a text file that has to be reviewed fully and to make sure it will be able to be automatically extracted.
- Contextual information can be imputed if necessary.
- In UniProt there are found_in and entries that are actual biomarkers:
- found_in will get a cross-reference;
- actual biomarkers will be directly integrated.
- Manual curation of 56 reviewed entries with mention of "biomarker" in flat text file.
- License: Creative Commons Attribution 4.0 International (CC BY 4.0).
CIViC
Status: Direct integration into data model
- Clinical Interpretation of Variants in Cancer (CIViC).
- Provides cancer biomarkers in form of DNA mutations (dbSNPs).
- Platform provides clinicians treatment options for patients based on unique tumor profile.
- License: Creative Commons Attribution-NonCommercial 4.0 International License.
ClinVar
Status: Direct integration into data model
- Public archive of reports of human variations classified for diseases and drug responses.
- Provides biomarkers for all disease, but we have only curated cancer biomarkers for now.
- dbSNPs
- File is really big but will go back and use existing script to map all biomarkers from here into the data model.
- License: Creative Commons Attribution-NonCommercial 4.0 International License.
Note: Only biomarkers from "cancer" and "carcinoma" tags were pulled. Pending integration of biomarkers for all diseases.
MarkerDB
Status: Cross-Reference
- Provides a lot of useful biomarker data and cross-references other resources as well.
- Information includes: panel information, abnormal levels of biomarkers by disease, structural information, etc.
- Annotations that can be cross-referenced include the above.
- By cross-referencing, BiomarkerKB will allow users to find more information for specific biomarkers and move towards the goal of being a comprehensive resource for biomarkers.
- License: Creative Commons Attribution-NonCommercial 4.0 International License.
Metabolomics Workbench
Status: Direct integration into data model
Data provided by Metabolomics Workbench
- Metabolite biomarkers utilized in the uniform newborn screening program.
- Detect treatable disorders that are life threatening or having long-term morbidity, before they become symptomatic.
OncoMX
Status: Direct integration into data model
- Integrated cancer mutation and expression resource for exploring cancer biomarkers
- Manual curation effort by GWU and JPL
- Over 600 single and panel biomarkers
- License: Creative Commons Attribution-NonCommercial 4.0 International License.
OpenTargets
Status: Direct integration into data model
- Collects potential drug targets and therapeutic targets.
- Some effort was required to find the correct biomarker data.
- 1200 biomarkers collected.
- dbSNPs related to cancer and other disease
- License: Creative Commons Attribution-NonCommercial 4.0 International License.
Note: Only cancer data was integrated.
PubMed Central Biomarker Gene Set
Status: Direct integration into data model
Data provided by Avi Ma'ayan's LINCS group
- This data set was created through manual curation of biomarker gene sets on Pubmed Central using the results of gene sets returned from Rummagene.
- Using the outputted search results within the Rummagene web server, we manually identified publications that associated different conditions and environmental exposures to biomarker gene sets.
- The biomarker gene sets were retrieved through the validation of the gene mentioned within each of the publications.
- The primary use case for this data is to identify biomarker panels/ gene sets associated with conditions.
SenNet
Status: Direction integration into data model
- Cell senescence biomarkers from SenNet group
- Biomarker data was collected and incorporated however biomarker field was incomplete and data integrated was given a score of -2
- Data is still valuable as contextual data and can be revisited to complete biomarker field in future
For infomation about Cross-references and Annotations in BiomarkerKB please visit - Xrefs and annotations
Pending Resources
biomarker.org
Reached out on March 17th, 2026 regarding data access and sent follow-up communications; however, no response was received.
caDSR
The Cancer Data Standards Registry and Repository (caDSR) is a metadata registry, not a biomarker knowledge source. It defines Common Data Elements (CDEs), including field names, definitions, and controlled value sets, but does not contain biomarker-condition relationships or evidence.
For BiomarkerKB, it could potentially be valuable for schema standardization rather than data ingestion. For example, align fields like condition, specimen, and entity type to controlled vocabularies (via NCI Thesaurus).
Recommendation: not ingestible as a data source.