BiomarkerKB Resource Integration: Difference between revisions

From BiomarkerKB Wiki
Jump to navigation Jump to search
MariaKim (talk | contribs)
MariaKim (talk | contribs)
No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
BiomarkerKB collects data from many different resources. The data that is collected is not always directly integrated into the data model and data from a resource is sometimes just added as valuable contextual annotations or cross references.
BiomarkerKB collects data from a wide range of resources. Not all collected data are directly integrated into the core data model; some are included as contextual annotations or cross-references to enrich existing entries.


Other resources to be explored: [https://cadsr.cancer.gov/onedata/Home.jsp CADSR Cancer], https://themarker.idrblab.cn/, biomarker.org, ResMarkerDB, SalivaDB, https://glycanage.com/publications, [https://www.cancergenomeinterpreter.org/biomarkers https://www.c], [https://github.com/issues/assigned?issue=clinical-biomarkers%7Cbiomarker-issue-repo%7C248 Glycan Biomarkers] ([https://github.com/glygener/CarboCurator code]), [https://www.alliancegenome.org/ Alliance Genome]
= Resources for Exploration =
*[https://cadsr.cancer.gov/onedata/Home.jsp CADSR Cancer]
*[https://themarker.idrblab.cn/ Marker Database]
*biomarker.org
*ResMarkerDB
*SalivaDB
*[https://glycanage.com/publications GlycanAge Publications]
*[https://www.cancergenomeinterpreter.org/biomarkers Cancer Genome Interpreter (Biomarkers)]
*[https://github.com/issues/assigned?issue=clinical-biomarkers%7Cbiomarker-issue-repo%7C248 Glycan Biomarkers] ([https://github.com/glygener/CarboCurator code])
*[https://www.alliancegenome.org/ Alliance Genome]


For suggestions of additional biomarker data resources, please contact: mazumder_lab@gwu.edu


= Data Sources =
== GWAS ==
'''Status''': Direct integration into data model
* Genome-wide association studies (GWAS) provide biomarkers in the form of SNPs.
* The GWAS Catalog includes SNPs associated with a wide range of diseases.
** Preliminary curation has only focused on cancer.
** As of 12/11/2026, biomarkers for all available conditions in the GWAS Catalog have been integrated.
* '''License''': CC BY-NC 4.0


Please contact us at mazumder_lab@gwu.edu and daniallmasood@gwu.edu if you have any other resources that may contain biomarker data
== MetaKB ==
 
'''Status''': Direct integration into data model
= GWAS =
Status: Direct Integration into Data Model
 
* Published genome-wide association studies (GWAS).
* Provides biomarkers in form of SNPs.
* GWAS Catalog contains SNPs for a vast amount of diseases.
** Preliminary curation only focused on cancer.
** All available biomarkers for conditions in GWAS Catalog are integrated 12/11/2026.
* License: Creative Commons Attribution-NonCommercial 4.0 International License.
 
= MetaKB =
Status: Direct Integration into Data Model
 
* Provides harmonized associations between cancer genomic variants, diseases, and therapeutic evidence.
* Provides harmonized associations between cancer genomic variants, diseases, and therapeutic evidence.
* Aggregates and standardizes variant interpretation data from six major knowledgebases:
* Aggregates and standardizes variant interpretation data from six major knowledgebases:
** CIViC (Clinical Interpretation of Variants in Cancer) [Already Integrated Directly]
** Clinical Interpretation of Variants in Cancer (CIViC) ''(integrated)''
** OncoKB [Yet to be integrated]
** OncoKB ''(restricted from commercial use)''
** JAX-CKB (The Jackson Laboratory Clinical Knowledgebase) [Yet to be integrated]
** The Jackson Laboratory Clinical Knowledgebase (JAX-CKB) ''(restricted from commercial use and has share-alike requirements for non-commercial use)''
** MolecularMatch [Yet to be integrated]
** MolecularMatch ''(restricted from commercial use)''
** PMKB (Precision Medicine Knowledgebase) [Yet to be integrated]
** Precision Medicine Knowledgebase) ''(pending integration)''
** Cancer Genome Interpreter (CGI) – through its ''Cancer Biomarkers Database'' component .[Integrated]
** Cancer Genome Interpreter (CGI) – through its ''Cancer Biomarkers Database'' component ''(integrated)''
* Enables mapping of variant–disease–drug relationships with supporting evidence levels, citations, and ontology alignment (e.g., genes, variants, diseases, and drugs).
* Enables mapping of:
* Data integration requires review to ensure harmonized entity mappings consistent with the BiomarkerKB data model.
** Variant → Disease → Drug relationships
** Evidence levels and citations
** Ontology-aligned entities (genes, variants, diseases, drugs)
* Notes:
** Requires validation of entity mappings against BiomarkerKB schema
* Focused on somatic variant–based biomarkers; contextual attributes such as tissue type, therapy response, or evidence type can be inferred or imputed where not directly specified.
* Focused on somatic variant–based biomarkers; contextual attributes such as tissue type, therapy response, or evidence type can be inferred or imputed where not directly specified.
* Manual curation may be required for entries with incomplete evidence annotation or lacking standard ontology references.
* Manual curation may be required for entries with incomplete evidence annotation or lacking standard ontology references.
Line 43: Line 52:
* Overall usage requires adherence to non-commercial research terms; commercial use needs separate permissions from individual data providers.
* Overall usage requires adherence to non-commercial research terms; commercial use needs separate permissions from individual data providers.


= Glycan LLM Biomarkers =
== Glycan LLM Biomarkers ==
'''Status''': Direct integration into data model
* LangChain LLM method used to collect biomarkers from PubMed Central abstracts
* LangChain LLM method used to collect biomarkers from PubMed Central abstracts
* Method identifies glycan entities and changes mentioned in them associated to disease
* Method identifies glycan entities and changes mentioned in them associated to disease
'''Note''': only biomarkers with <code>assessed_entity_type: protein</code> were integrated, with the goal of expanding to glycan entity types once the Glycan Structure Dictionary is finalized.


= Top 50 Biomarkers =
== Top 50 Biomarkers ==
Status: Direct Integration into Data Model
'''Status''': Direct integration into data model
* Biomarkers collected during Summer Volunteership
* Biomarkers collected during Summer Volunteership
* Volunteers identified top 50 biomarker entities from BiomarkerKB
* Volunteers identified top 50 biomarker entities from BiomarkerKB
Line 54: Line 65:
* 100 biomarkers were manually curated
* 100 biomarkers were manually curated


*
== EDRN ==
 
'''Status''': Sample integration into data model
= EDRN =
Status: Sample Integration into Data Model
 
* Cancer biomarkers.
* Cancer biomarkers.
* Sample of EDRN Biomarkers provided from EDRN LLM method
* Sample of EDRN Biomarkers provided from EDRN LLM method
* Biomarkers are extracted from free text in EDRN publicly available biomarkers
* Biomarkers are extracted from free text in EDRN publicly available biomarkers


= LOINC =
== LOINC ==
Status: Cross-Reference
'''Status''': Direct integration into data model
 
* Metabolite data only
''Data provided by Metabolomics Workbench''
* We are currently working with the Metabolomics Workbench group to get the complete data
 
= OncoKB =
Status: Cross-Reference


== OncoKB ==
'''Status''': Cross-Reference
* Provides useful information on drugs and therapy options for different biomarker entities.
* Provides useful information on drugs and therapy options for different biomarker entities.
* Also provides information based on what condition the entity is related to.
* Also provides information based on what condition the entity is related to.
* License: A license is required to use OncoKB for commercial and/or clinical purposes, and to access OncoKB data programmatically for academic purposes.
* '''License''': A license is required to use OncoKB for commercial and/or clinical purposes, and to access OncoKB data programmatically for academic purposes.
* Paid license is required
* Paid license is required
* Cross-reference from biomarkers in BiomarkerKB to the appropriate drug information and therapy information is the best solution.
* Cross-reference from biomarkers in BiomarkerKB to the appropriate drug information and therapy information is the best solution.


=HPO=
== HPO ==
 
'''Status''': Cross-Reference
Status: Cross-Reference
 
* HPO provides disease and entity associations.
* HPO provides disease and entity associations.
* Does not provide a change within the entity so we cannot collect biomarker data from here.
* Does not provide a change within the entity so we cannot collect biomarker data from here.
Line 86: Line 91:
* Provides cross-reference to OMIM, SNOMED, and MONDO.
* Provides cross-reference to OMIM, SNOMED, and MONDO.


= UniProtKB =
== UniProtKB ==
Status: Direct Integration into Data Model
'''Status''': Direct integration into data model
 
* Can provide biomarker (change in entity), entity, condition, and sampling data.
* Can provide biomarker (change in entity), entity, condition, and sampling data.
* This data is in a text file that has to be reviewed fully and to make sure it will be able to be automatically extracted.
* This data is in a text file that has to be reviewed fully and to make sure it will be able to be automatically extracted.
Line 96: Line 100:
** actual biomarkers will be directly integrated.
** actual biomarkers will be directly integrated.
* Manual curation of 56 reviewed entries with mention of "biomarker" in flat text file.
* Manual curation of 56 reviewed entries with mention of "biomarker" in flat text file.
* License is Creative Commons Attribution 4.0 International (CC BY 4.0).
* '''License''': Creative Commons Attribution 4.0 International (CC BY 4.0).
 
= CIViC =
Status: Direct Integration into Data Model


== CIViC ==
'''Status''': Direct integration into data model
* Clinical Interpretation of Variants in Cancer (CIViC).
* Clinical Interpretation of Variants in Cancer (CIViC).
* Provides cancer biomarkers in form of DNA mutations (dbSNPs).
* Provides cancer biomarkers in form of DNA mutations (dbSNPs).
* Platform provides clinicians treatment options for patients based on unique tumor profile.
* Platform provides clinicians treatment options for patients based on unique tumor profile.
* License: Creative Commons Attribution-NonCommercial 4.0 International License.
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License.
 
=ClinVar=
Status: Direct Integration into Data Model


== ClinVar ==
'''Status''': Direct integration into data model
* Public archive of reports of human variations classified for diseases and drug responses.
* Public archive of reports of human variations classified for diseases and drug responses.
* Provides biomarkers for all disease, but we have only curated cancer biomarkers for now.
* Provides biomarkers for all disease, but we have only curated cancer biomarkers for now.
** dbSNPs
** dbSNPs
** File is really big but will go back and use existing script to map all biomarkers from here into the data model.
** File is really big but will go back and use existing script to map all biomarkers from here into the data model.
* License: Creative Commons Attribution-NonCommercial 4.0 International License.
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License.
 
'''Note''': Only biomarkers from "cancer" and "carcinoma" tags were pulled. Pending integration of biomarkers for all diseases.
= MarkerDB =
Status: Direct Integration into Data Model


== MarkerDB ==
'''Status''': Cross-Reference
* Provides a lot of useful biomarker data and cross-references other resources as well.
* Provides a lot of useful biomarker data and cross-references other resources as well.
* Information includes: panel information, abnormal levels of biomarkers by disease, structural information, etc.
* Information includes: panel information, abnormal levels of biomarkers by disease, structural information, etc.
* Annotations that can be cross-referenced include the above.
* Annotations that can be cross-referenced include the above.
* By cross-referencing, BiomarkerKB will allow users to find more information for specific biomarkers and move towards the goal of being a comprehensive resource for biomarkers.
* By cross-referencing, BiomarkerKB will allow users to find more information for specific biomarkers and move towards the goal of being a comprehensive resource for biomarkers.
* License: Creative Commons Attribution-NonCommercial 4.0 International License.
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License.


=Metabolomics Workbench=
== Metabolomics Workbench ==
Status: Direct Integration into Data Model
'''Status''': Direct integration into data model


''Data provided by Metabolomics Workbench''
''Data provided by Metabolomics Workbench''
* Metabolite biomarkers utilized in the uniform newborn screening program.
* Metabolite biomarkers utilized in the uniform newborn screening program.
* Detect treatable disorders that are life threatening or having long-term morbidity, before they become symptomatic.
* Detect treatable disorders that are life threatening or having long-term morbidity, before they become symptomatic.


=OncoMX=
== OncoMX ==
Status: Direct Integration into Data Model
'''Status''': Direct integration into data model
 
* Integrated cancer mutation and expression resource for exploring cancer biomarkers
* integrated cancer mutation and expression resource for exploring cancer biomarkers
* Manual curation effort by GWU and JPL
* Manual curation effort by GWU and JPL
* Over 600 single and panel biomarkers
* Over 600 single and panel biomarkers
* License: Creative Commons Attribution-NonCommercial 4.0 International License.
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License.
 
=OpenTargets=
Status: Direct Integration into Data Model


== OpenTargets ==
'''Status''': Direct integration into data model
* Collects potential drug targets and therapeutic targets.
* Collects potential drug targets and therapeutic targets.
* Some effort was required to find the correct biomarker data.
* Some effort was required to find the correct biomarker data.
* 1200 biomarkers collected.
* 1200 biomarkers collected.
** dbSNPs related to cancer and other disease
** dbSNPs related to cancer and other disease
* License: Creative Commons Attribution-NonCommercial 4.0 International License.
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License.
'''Note''': Only cancer data was integrated.


=PubMed Central Biomarker Gene Set Curation=
== PubMed Central Biomarker Gene Set ==
Status: Direct Integration into Data Model
'''Status''': Direct integration into data model


''Data provided by Avi Ma'ayan's LINCS group''
''Data provided by Avi Ma'ayan's LINCS group''
* This data set was created through manual curation of biomarker gene sets on Pubmed Central using the results of gene sets returned from Rummagene.  
* This data set was created through manual curation of biomarker gene sets on Pubmed Central using the results of gene sets returned from Rummagene.  
* Using the outputted search results within the Rummagene web server, we manually identified publications that associated different conditions and environmental exposures to biomarker gene sets.  
* Using the outputted search results within the Rummagene web server, we manually identified publications that associated different conditions and environmental exposures to biomarker gene sets.  
Line 159: Line 158:
* The primary use case for this data is to identify biomarker panels/ gene sets associated with conditions.
* The primary use case for this data is to identify biomarker panels/ gene sets associated with conditions.


= SenNet Biomarker Data =
== SenNet ==
Status: Direction Integration Into Data Model
'''Status''': Direction integration into data model
 
* Cell senescence biomarkers from SenNet group
* Cell senescence biomarkers from SenNet group
* Biomarker data was collected and incorporated however biomarker field was incomplete and data integrated was given a score of -2
* Biomarker data was collected and incorporated however biomarker field was incomplete and data integrated was given a score of -2
* Data is still valuable as contextual data and can be revisited to complete biomarker field in future
* Data is still valuable as contextual data and can be revisited to complete biomarker field in future
For infomation about Cross-references and Annotations in BiomarkerKB please visit - [[Xrefs and annotations]]
For infomation about Cross-references and Annotations in BiomarkerKB please visit - [[Xrefs and annotations]]

Latest revision as of 16:08, 20 March 2026

BiomarkerKB collects data from a wide range of resources. Not all collected data are directly integrated into the core data model; some are included as contextual annotations or cross-references to enrich existing entries.

Resources for Exploration

For suggestions of additional biomarker data resources, please contact: mazumder_lab@gwu.edu

Data Sources

GWAS

Status: Direct integration into data model

  • Genome-wide association studies (GWAS) provide biomarkers in the form of SNPs.
  • The GWAS Catalog includes SNPs associated with a wide range of diseases.
    • Preliminary curation has only focused on cancer.
    • As of 12/11/2026, biomarkers for all available conditions in the GWAS Catalog have been integrated.
  • License: CC BY-NC 4.0

MetaKB

Status: Direct integration into data model

  • Provides harmonized associations between cancer genomic variants, diseases, and therapeutic evidence.
  • Aggregates and standardizes variant interpretation data from six major knowledgebases:
    • Clinical Interpretation of Variants in Cancer (CIViC) (integrated)
    • OncoKB (restricted from commercial use)
    • The Jackson Laboratory Clinical Knowledgebase (JAX-CKB) (restricted from commercial use and has share-alike requirements for non-commercial use)
    • MolecularMatch (restricted from commercial use)
    • Precision Medicine Knowledgebase) (pending integration)
    • Cancer Genome Interpreter (CGI) – through its Cancer Biomarkers Database component (integrated)
  • Enables mapping of:
    • Variant → Disease → Drug relationships
    • Evidence levels and citations
    • Ontology-aligned entities (genes, variants, diseases, drugs)
  • Notes:
    • Requires validation of entity mappings against BiomarkerKB schema
  • Focused on somatic variant–based biomarkers; contextual attributes such as tissue type, therapy response, or evidence type can be inferred or imputed where not directly specified.
  • Manual curation may be required for entries with incomplete evidence annotation or lacking standard ontology references.
  • Integration approach: direct mapping of variant, condition, and evidence entities; cross-references retained to original data sources.
  • License: Aggregated data are available for non-commercial, research use only, respecting constituent licenses:
    • CIViC – CC0 (Public Domain)
    • PMKB – CC-BY 4.0
    • CGI – CC0 for biomarkers database, CC-BY-NC 4.0 for tool
    • JAX-CKB – CC-BY-NC-SA 4.0
    • OncoKB – custom non-commercial license
    • MolecularMatch – restricted commercial use
    • MetaKB codebase – MIT license
  • Overall usage requires adherence to non-commercial research terms; commercial use needs separate permissions from individual data providers.

Glycan LLM Biomarkers

Status: Direct integration into data model

  • LangChain LLM method used to collect biomarkers from PubMed Central abstracts
  • Method identifies glycan entities and changes mentioned in them associated to disease

Note: only biomarkers with assessed_entity_type: protein were integrated, with the goal of expanding to glycan entity types once the Glycan Structure Dictionary is finalized.

Top 50 Biomarkers

Status: Direct integration into data model

  • Biomarkers collected during Summer Volunteership
  • Volunteers identified top 50 biomarker entities from BiomarkerKB
  • Using this information the top 50 biomarker entities were searched in PubMed
  • 100 biomarkers were manually curated

EDRN

Status: Sample integration into data model

  • Cancer biomarkers.
  • Sample of EDRN Biomarkers provided from EDRN LLM method
  • Biomarkers are extracted from free text in EDRN publicly available biomarkers

LOINC

Status: Direct integration into data model

  • Metabolite data only
  • We are currently working with the Metabolomics Workbench group to get the complete data

OncoKB

Status: Cross-Reference

  • Provides useful information on drugs and therapy options for different biomarker entities.
  • Also provides information based on what condition the entity is related to.
  • License: A license is required to use OncoKB for commercial and/or clinical purposes, and to access OncoKB data programmatically for academic purposes.
  • Paid license is required
  • Cross-reference from biomarkers in BiomarkerKB to the appropriate drug information and therapy information is the best solution.

HPO

Status: Cross-Reference

  • HPO provides disease and entity associations.
  • Does not provide a change within the entity so we cannot collect biomarker data from here.
  • However we can use it as a cross-reference within our cross-referencing section.
  • Provides cross-reference to OMIM, SNOMED, and MONDO.

UniProtKB

Status: Direct integration into data model

  • Can provide biomarker (change in entity), entity, condition, and sampling data.
  • This data is in a text file that has to be reviewed fully and to make sure it will be able to be automatically extracted.
  • Contextual information can be imputed if necessary.
  • In UniProt there are found_in and entries that are actual biomarkers:
    • found_in will get a cross-reference;
    • actual biomarkers will be directly integrated.
  • Manual curation of 56 reviewed entries with mention of "biomarker" in flat text file.
  • License: Creative Commons Attribution 4.0 International (CC BY 4.0).

CIViC

Status: Direct integration into data model

  • Clinical Interpretation of Variants in Cancer (CIViC).
  • Provides cancer biomarkers in form of DNA mutations (dbSNPs).
  • Platform provides clinicians treatment options for patients based on unique tumor profile.
  • License: Creative Commons Attribution-NonCommercial 4.0 International License.

ClinVar

Status: Direct integration into data model

  • Public archive of reports of human variations classified for diseases and drug responses.
  • Provides biomarkers for all disease, but we have only curated cancer biomarkers for now.
    • dbSNPs
    • File is really big but will go back and use existing script to map all biomarkers from here into the data model.
  • License: Creative Commons Attribution-NonCommercial 4.0 International License.

Note: Only biomarkers from "cancer" and "carcinoma" tags were pulled. Pending integration of biomarkers for all diseases.

MarkerDB

Status: Cross-Reference

  • Provides a lot of useful biomarker data and cross-references other resources as well.
  • Information includes: panel information, abnormal levels of biomarkers by disease, structural information, etc.
  • Annotations that can be cross-referenced include the above.
  • By cross-referencing, BiomarkerKB will allow users to find more information for specific biomarkers and move towards the goal of being a comprehensive resource for biomarkers.
  • License: Creative Commons Attribution-NonCommercial 4.0 International License.

Metabolomics Workbench

Status: Direct integration into data model

Data provided by Metabolomics Workbench

  • Metabolite biomarkers utilized in the uniform newborn screening program.
  • Detect treatable disorders that are life threatening or having long-term morbidity, before they become symptomatic.

OncoMX

Status: Direct integration into data model

  • Integrated cancer mutation and expression resource for exploring cancer biomarkers
  • Manual curation effort by GWU and JPL
  • Over 600 single and panel biomarkers
  • License: Creative Commons Attribution-NonCommercial 4.0 International License.

OpenTargets

Status: Direct integration into data model

  • Collects potential drug targets and therapeutic targets.
  • Some effort was required to find the correct biomarker data.
  • 1200 biomarkers collected.
    • dbSNPs related to cancer and other disease
  • License: Creative Commons Attribution-NonCommercial 4.0 International License.

Note: Only cancer data was integrated.

PubMed Central Biomarker Gene Set

Status: Direct integration into data model

Data provided by Avi Ma'ayan's LINCS group

  • This data set was created through manual curation of biomarker gene sets on Pubmed Central using the results of gene sets returned from Rummagene.
  • Using the outputted search results within the Rummagene web server, we manually identified publications that associated different conditions and environmental exposures to biomarker gene sets.
  • The biomarker gene sets were retrieved through the validation of the gene mentioned within each of the publications.
  • The primary use case for this data is to identify biomarker panels/ gene sets associated with conditions.

SenNet

Status: Direction integration into data model

  • Cell senescence biomarkers from SenNet group
  • Biomarker data was collected and incorporated however biomarker field was incomplete and data integrated was given a score of -2
  • Data is still valuable as contextual data and can be revisited to complete biomarker field in future

For infomation about Cross-references and Annotations in BiomarkerKB please visit - Xrefs and annotations