BiomarkerKB Wiki - User contributions [en]

Data Release Notes

2025-12-15T17:41:14Z

RajaMazumder: /* Versioning Format */

== Versioning Format ==
The versioning format follows a three-digit structure: X.Y.Z.
* The first digit (X) changes when a major update is introduced, such as changes in the data model.
* The second digit (Y) increments when new data is added.
* The third digit (Z) is updated for bug fixes or minor changes.

== Version 2.1.0 ==
Date: December 11, 2025
=== Data Updates ===
* Added the LLM-extracted glycan biomarker dataset provided by Cyrus Chun Hong Au Yeung.
=== Backend and Infrastructure Updates ===
* The incorrect download links on the [https://data.biomarkerkb.org Data Portal] have been fixed.
* LOINC codes are no longer tied to specimen IDs.

== Version 2.0.2 ==
Date: December 4, 2025
=== Bug Fixes ===
* LOINC codes are no longer tied to specimen (UBERON) IDs.
* For biomarkers that could not be mapped to [[Controlled Vocabulary and Keywords|Controlled Vocabulary]] the original biomarker name is displayed, followed by "in review".

== Version 2.0.1 ==
=== Data Updates ===
* Added cross-references to the Common Fund Data Ecosystem ([https://commonfund.nih.gov/dataecosystem CFDE]) Data Coordinating Centers and other resources:
** [https://www.gtexportal.org/home/ GTEx]
** [https://pharos.nih.gov/ Pharos]
** [https://reactome.org/ Reactome]
** [https://undiagnosed.hms.harvard.edu/ Undiagnosed Diseases Network]
** [https://idg.reactome.org/ Illuminating the Druggable Genome (IDG) Reactome Portal]
** [https://www.metabolomicsworkbench.org/ Metabolomics Workbench]
** [https://maayanlab.cloud/sigcom-lincs SigCom LINCS]

== Version 2.0.0 ==
=== Data Updates ===
* The biomarker field is now standardized using controlled vocabulary terms.
* Added metabolite as an <code>assessed_entity_type</code> to <code>mw_loinc_biomarkers.tsv</code>.
* Added [https://rnacentral.org/ RNAcentral] cross-reference support.
* Added Electronic Health Records Normal ranges data from Oracle Health for Troponin I as an example.

== Version 1.0.6 ==
=== Data Updates ===
* Added a new dataset: MW LOINC biomarkers (<code>mw_loinc_biomarkers.tsv</code>).
* Added [https://ncithesaurus.nci.nih.gov/ National Cancer Institute Thesaurus] and [https://www.rcsb.org/ Protein Data Bank] cross-references.
=== Backend and Infrastructure Updates ===
* Added the <code>display_name</code> field to the <code>format-converter</code> so data source names appear with correct casing.

== Version 1.0.5 ==
=== Data Updates ===
* Updated the Troponin biomarker value <code>assessed_biomarker_entity</code> for consistency.
* Added normal ranges from Electronic Health Records provided by the University of New Mexico for Troponin biomarkers.
* Added Cell Ontology and Protein Ontology cross-references.
=== Backend and Infrastructure Updates ===
* Updated all script paths to use <code>data_source.conf</code> and validated data source names.

== Version 1.0.4 ==
This release introduces new datasets, cross-references, and bug fixes.
=== Data Updates ===
* Added Cancer Genome Interpreter data on cancer biomarkers from MetaKB.
* Added Metabolomics Workbench LOINC data on metabolite biomarkers.
* Added Cell Ontology and Protein Ontology cross-references.
=== Bug Fixes ===
* Fixed issue where cookie preferences weren't being saved when selecting "Allow".

== Version 1.0.3 ==
This release introduces new cross-references and updates to ensure compatibility with external resources.
=== Data Updates ===
* NCBI cross-references added across gene biomarker entries.
* ChEBI cross-references integrated for small molecules and metabolites.
=== Backend and Infrastructure Updates ===
* ChEBI API migration: Updated all programmatic links from the legacy SOAP services to the new REST API endpoints, following ChEBI’s platform migration.
** Old services retired 1 September 2025.
** New stable API: [https://www.ebi.ac.uk/chebi/backend/api/docs ChEBI REST API docs]
** New data products and beta interface available at [https://www.ebi.ac.uk/chebi/beta/ ChEBI 2.0].
== Version 1.0.2 ==
=== Data Updates ===
* Published updated [https://www.metabolomicsworkbench.org/ Metabolomics Workbench] data.
* Published sample data from the [https://edrn.nci.nih.gov/ Early Detection Research Network].
=== Backend and Infrastructure Updates ===
* <code>evidence_source</code> database names now retain their original casing for accuracy and consistency.
* EDRN identifiers were added to the [https://github.com/clinical-biomarkers/format-converter/blob/main/mapping_data/namespace_map.json namespace map].
* [https://www.genenames.org/ HUGO Gene Nomenclature Committee] (HGNC) was added to the cross-reference JSON file.
* Fixed an issue where <code>evidence_source</code> values without tags were previously dropped; these are now preserved.
* Added a user-guided spelling correction function to improve data entry quality.
* The TSV-to-JSON converter now automatically checks for header spelling errors.
* Introduced <code>_suggest_header_corrections</code> to flag and propose fixes for misspelled headers.
* Enhanced <code>_stream_tsv</code> with a call to <code>_check_header_spelling</code> to prevent invalid headers from being processed.

== Version 1.0.1 ==
=== Data Updates ===
* Added <code> xrefs.tsv</code> to the list of datasets.
=== Backend & Infrastructure Updates ===
* Fixed ID formatting issues in NCBI and UniProt references within <code> oncomx.tsv</code>, removing erroneous spaces (e.g., <code> NCBI: 3288</code> → <code> NCBI:3288</code>) and extraneous text (e.g., <code>"(composition)"</code>). Affected biomarkers included AN6295-1, AN6756-1, AN6728-1, and others.
* Merged assessed entity type synonyms.

== Version 1.0.0 ==
* BiomarkerKB data portal available with OncoMX, OpenTargets, MarkerDB, ClinVar, PubMed Central Biomarker Gene Set Curation, MW, UniProtKB, GWAS, CIViC biomarker data.

BiomarkerKB Resource Integration

2025-10-23T19:25:46Z

RajaMazumder:

BiomarkerKB collects data from many different resources. The data that is collected is not always directly integrated into the data model and data from a resource is sometimes just added as valuable contextual annotations or cross references.

Other resources to be explored: [https://search.cancervariants.org/ MetaKB], [https://cadsr.cancer.gov/onedata/Home.jsp CADSR Cancer], https://themarker.idrblab.cn/, biomarker.org, ResMarkerDB, SalivaDB, https://glycanage.com/publications, https://www.cancergenomeinterpreter.org/biomarkers, [https://github.com/issues/assigned?issue=clinical-biomarkers%7Cbiomarker-issue-repo%7C248 Glycan Biomarkers] ([https://github.com/glygener/CarboCurator code])

Please contact us at mazumder_lab@gwu.edu and daniallmasood@gwu.edu if you have any other resources that may contain biomarker data

=CIViC=
Status: Direct Integration into Data Model

* Clinical Interpretation of Variants in Cancer (CIViC).
* Provides cancer biomarkers in form of DNA mutations (dbSNPs).
* Platform provides clinicians treatment options for patients based on unique tumor profile.
* License: Creative Commons Attribution-NonCommercial 4.0 International License.

=ClinVar=
Status: Direct Integration into Data Model

* Public archive of reports of human variations classified for diseases and drug responses.
* Provides biomarkers for all disease, but we have only curated cancer biomarkers for now.
** dbSNPs
** File is really big but will go back and use existing script to map all biomarkers from here into the data model.
* License: Creative Commons Attribution-NonCommercial 4.0 International License.

=EDRN=
Status: Sample Integration into Data Model

* Cancer biomarkers.

=GWAS=
Status: Direct Integration into Data Model

* Published genome-wide association studies (GWAS).
* Provides biomarkers in form of SNPs.
* GWAS Catalog contains SNPs for a vast amount of diseases.
** Preliminary curation only focused on cancer.
** Will use existing script to map all biomarkers into data model.
* License: Creative Commons Attribution-NonCommercial 4.0 International License.

=HPO=

Status: Cross-Reference

* HPO provides disease and entity associations.
* Does not provide a change within the entity so we cannot collect biomarker data from here.
* However we can use it as a cross-reference within our cross-referencing section.
* Provides cross-reference to OMIM, SNOMED, and MONDO.

=LOINC=
Status: Cross-Reference

''Data provided by Metabolomics Workbench''

=MarkerDB=
Status: Direct Integration into Data Model

* Provides a lot of useful biomarker data and cross-references other resources as well.
* Information includes: panel information, abnormal levels of biomarkers by disease, structural information, etc.
* Annotations that can be cross-referenced include the above.
* By cross-referencing, BiomarkerKB will allow users to find more information for specific biomarkers and move towards the goal of being a comprehensive resource for biomarkers.
* License: Creative Commons Attribution-NonCommercial 4.0 International License.

=Metabolomics Workbench=
Status: Direct Integration into Data Model

''Data provided by Metabolomics Workbench''

* Metabolite biomarkers utilized in the uniform newborn screening program.
* Detect treatable disorders that are life threatening or having long-term morbidity, before they become symptomatic.

=OncoKB=
Status: Cross-Reference

* Provides useful information on drugs and therapy options for different biomarker entities.
* Also provides information based on what condition the entity is related to.
* License: A license is required to use OncoKB for commercial and/or clinical purposes, and to access OncoKB data programmatically for academic purposes.
* Paid license is required
* Cross-reference from biomarkers in BiomarkerKB to the appropriate drug information and therapy information is the best solution.

=OncoMX=
Status: Direct Integration into Data Model

* integrated cancer mutation and expression resource for exploring cancer biomarkers
* Manual curation effort by GWU and JPL
* Over 600 single and panel biomarkers
* License: Creative Commons Attribution-NonCommercial 4.0 International License.

=OpenTargets=
Status: Direct Integration into Data Model

* Collects potential drug targets and therapeutic targets.
* Some effort was required to find the correct biomarker data.
* 1200 biomarkers collected.
** dbSNPs related to cancer and other disease
* License: Creative Commons Attribution-NonCommercial 4.0 International License.

=PubMed Central Biomarker Gene Set Curation=
Status: Direct Integration into Data Model

''Data provided by Avi Ma'ayan's LINCS group''

* This data set was created through manual curation of biomarker gene sets on Pubmed Central using the results of gene sets returned from Rummagene.
* Using the outputted search results within the Rummagene web server, we manually identified publications that associated different conditions and environmental exposures to biomarker gene sets.
* The biomarker gene sets were retrieved through the validation of the gene mentioned within each of the publications.
* The primary use case for this data is to identify biomarker panels/ gene sets associated with conditions.

=UniProtKB=

Status: Direct Integration into Data Model

* Can provide biomarker (change in entity), entity, condition, and sampling data.
* This data is in a text file that has to be reviewed fully and to make sure it will be able to be automatically extracted.
* Contextual information can be imputed if necessary.
* In UniProt there are found_in and entries that are actual biomarkers:
** found_in will get a cross-reference;
** actual biomarkers will be directly integrated.
* Manual curation of 56 reviewed entries with mention of "biomarker" in flat text file.
* License is Creative Commons Attribution 4.0 International (CC BY 4.0).