BiomarkerKB Wiki - User contributions [en]

Data Submission/Data Upload

2026-05-26T01:24:59Z

MariaKim: /* Instructions to submit Biomarker Data */

==Instructions to submit Biomarker Data==
To submit data for the BiomarkerKB Portal, the biomarker data model must be followed. Instructions on how to format the data for submission, where to send it, and creating a BCO for the data submitted will be provided below.

# Biomarker data collected should follow the biomarker data model.
# "Core" fields should be filled in from the data source where biomarker data is collected. Core fields:
## <code>biomarker</code>
## <code>assessed_biomarker_entity</code> and <code>assessed_biomarker_entity_id</code>
## <code>condition</code> and <code>condition_id</code> OR <code>exposure_agent</code> and <code>exposure_agent_id</code>
## <code>component_group</code> containing integers (1, 2, 3...) from 1 to N where N is the number of components. Normally N would simply be equal to the number of rows, unless your data contains multicomponent biomarkers. A multicomponent biomarker must have the same integer in all rows related to that biomarker.
# Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources.
## <code>evidence</code> is one or more exact citations from the evidence source (in most cases, it will be the PubMed publication).
# Apply the following standards to the data when possible:
## <code>condition_id</code> = <code>DOID:0080600</code>. Refer to https://disease-ontology.org/do/.
## <code>specimen_id</code> = <code>UBERON:0000178</code>. Refer to https://www.ebi.ac.uk/ols4/ontologies/uberon.
## <code>loinc_code</code> = <code>LOINC:100153-6</code>. Refer to https://loinc.org/ (you may need to create an account to access the search functionality).
## <code>evidence_source</code> = <code>SOURCE:ID</code>, for example <code>PubMed:32677844</code>
## For <code>assessed_biomarker_entity_id</code> please refer to the [https://github.com/clinical-biomarkers/biomarker-controlled-vocabulary GitHub documentation] for which standards to follow
# Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file.
## For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet.
# Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent.
## The preferred method for data submission is a json file as it will help ingest the data into the existing data efficiently. However, tsv file submissions are ok as well. In the GitHub, <code>data_conversion.py</code> script exists in the Data Conversion Folder and it will handle tsv to json file conversion and json to tsv file conversion as well.
## The [BiomarkerKB data page] has examples of tsv data submissions and how the data should be formatted with the appropriate biomarker fields. Example
# For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation
# If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker.

=== Submission ===
Once data is formatted and cleaned please send any data to mazumder_lab@gwu.edu.
# Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form.
## This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. [https://hivelab.biochemistry.gwu.edu/biomarker-partnership/data/BCO_000435 Example]
# If there are any further questions please consult the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for contributing data or reach out to Daniall using the email above.

==Standardized and Controlled Vocabulary==
There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out.

=== condition ===
<code>condition</code> should be reported in all lowercase and <code>condition_id</code> (from Disease Ontology, MONDO, or SNOMED) should be provided in the following column

=== assessed_biomarker_entity ===
assessed_biomarker_entity is the entity in which the change is assessed.

Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6).

If the entity type is anything but a gene the whole name should be typed out.

=== assessed_entity_type ===
Report in all lowercase.

=== assessed_biomarker_entity_id ===
Refer to the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for the correct resource.

=== best_biomarker_role ===
Report in all lowercase. Refer to the [https://www.ncbi.nlm.nih.gov/books/NBK326791/ BEST Resource] to infer the correct biomarker role. Accepted role terms are:
* '''diagnostic''': Detects or confirms the presence of a disease or condition, or identifies individuals with a specific disease subtype.
* '''monitoring''': Assesses the status of a disease, medical condition, or exposure to a medical product over time.
* '''predictive''': Identifies which patients are more or less likely to respond favorably or unfavorably to a specific treatment or exposure.
* '''prognostic''': Identifies the likelihood of a clinical event, disease recurrence, or progression in patients with an already established disease or condition.
* '''response''': Shows that a biological response has occurred in a patient after being exposed to a medical product or environmental agent.
* '''risk''': Indicates the potential for an individual to develop a disease or condition in the future.
* '''safety''': Measures or indicates the likelihood, nature, or severity of adverse effects, toxicity, or organ injury.

=== specimen ===
Report in all lowercase and <code>specimen_id</code> in the following column should be from UBERON.

=== biomarker ===
The biomarker field is the most important. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase.

==== Cell Biomarker ====
Should be reported as either:

* '''increased *cell name* count'''
* '''decreased *cell name* count'''
* Example: increased WBC count

==== Chemical Element Biomarker ====
Should be reported as either:

* '''increased *chemical element* level'''
* '''decreased *chemical element* level'''
* Example: increased Na+ level

==== DNA/RNA Biomarker ====
Should be reported as either:

* '''increased *DNA/RNA* level'''
* '''decreased *DNA/RNA* level'''
* Example: increased cfDNA level

==== Gene Biomarker ====
If the entity is a gene then there are different ways to report the biomarker based on how the mutation is reported:

* Expression of gene:
** '''*gene symbol* overexpression'''
** '''*gene symbol* underexpression'''
** Example: EGFR overexpression
* Amplification of gene: '''*gene symbol* amplification'''
* Specific site mutation in the expressed protein that is caused by the gene: '''*gene symbol* *site mutation* mutation'''
** Example: BRAF V600E mutation
* SNPs: '''presence of *dbSNP ID* mutation in *gene symbol*'''
** Example: presence of rs180177132 mutation in PALB2

==== Glycan Biomarker ====
Should be reported as: '''increased *glycan* level'''

* Example: increased N-glycan level

==== Metabolite Biomarker ====
Should be reported as:

* '''increased *metabolite* level'''
* '''decreased *metabolite* level'''
* Example: increased Urea level

==== Protein Biomarker ====
Should be reported as either:

* '''increased *HGNC gene symbol* level'''
* '''decreased *HGNC gene symbol* level'''
* Example: increased IL6 level

For more examples please refer to the [https://data.biomarkerkb.org/ BiomarkerKB Data Page]

Data Submission/Data Upload

2026-05-19T14:29:18Z

MariaKim: best_biomarker_role

==Instructions to submit Biomarker Data==
To submit data for the BiomarkerKB Portal, the biomarker data model must be followed. Instructions on how to format the data for submission, where to send it, and creating a BCO for the data submitted will be provided below.

# Biomarker data collected should follow the biomarker data model.
# "Core" fields should be filled in from the data source where biomarker data is collected. Core fields:
## <code>biomarker</code>
## <code>assessed_biomarker_entity</code> and <code>assessed_biomarker_entity_id</code>
## <code>condition</code> and <code>condition_id</code> OR <code>exposure_agent</code> and <code>exposure_agent_id</code>
## <code>component_group</code> containing integers (1, 2, 3...). A multicomponent biomarker must have the same integer in all rows related to that biomarker.
# Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources.
## <code>evidence</code> is one or more exact citations from the evidence source (in most cases, it will be the PubMed publication).
# Apply the following standards to the data when possible:
## <code>condition_id</code> = <code>DOID:0080600</code>. Refer to https://disease-ontology.org/do/.
## <code>specimen_id</code> = <code>UBERON:0000178</code>. Refer to https://www.ebi.ac.uk/ols4/ontologies/uberon.
## <code>loinc_code</code> = <code>LOINC:100153-6</code>. Refer to https://loinc.org/ (you may need to create an account to access the search functionality).
## <code>evidence_source</code> = <code>SOURCE:ID</code>, for example <code>PubMed:32677844</code>
## For <code>assessed_biomarker_entity_id</code> please refer to the [https://github.com/clinical-biomarkers/biomarker-controlled-vocabulary GitHub documentation] for which standards to follow
# Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file.
## For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet.
# Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent.
## The preferred method for data submission is a json file as it will help ingest the data into the existing data efficiently. However, tsv file submissions are ok as well. In the GitHub, <code>data_conversion.py</code> script exists in the Data Conversion Folder and it will handle tsv to json file conversion and json to tsv file conversion as well.
## The [BiomarkerKB data page] has examples of tsv data submissions and how the data should be formatted with the appropriate biomarker fields. Example
# For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation
# If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker.

=== Submission ===
Once data is formatted and cleaned please send any data to mazumder_lab@gwu.edu.
# Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form.
## This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. [https://hivelab.biochemistry.gwu.edu/biomarker-partnership/data/BCO_000435 Example]
# If there are any further questions please consult the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for contributing data or reach out to Daniall using the email above.

==Standardized and Controlled Vocabulary==
There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out.

=== condition ===
<code>condition</code> should be reported in all lowercase and <code>condition_id</code> (from Disease Ontology, MONDO, or SNOMED) should be provided in the following column

=== assessed_biomarker_entity ===
assessed_biomarker_entity is the entity in which the change is assessed.

Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6).

If the entity type is anything but a gene the whole name should be typed out.

=== assessed_entity_type ===
Report in all lowercase.

=== assessed_biomarker_entity_id ===
Refer to the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for the correct resource.

=== best_biomarker_role ===
Report in all lowercase. Refer to the [https://www.ncbi.nlm.nih.gov/books/NBK326791/ BEST Resource] to infer the correct biomarker role. Accepted role terms are:
* '''diagnostic''': Detects or confirms the presence of a disease or condition, or identifies individuals with a specific disease subtype.
* '''monitoring''': Assesses the status of a disease, medical condition, or exposure to a medical product over time.
* '''predictive''': Identifies which patients are more or less likely to respond favorably or unfavorably to a specific treatment or exposure.
* '''prognostic''': Identifies the likelihood of a clinical event, disease recurrence, or progression in patients with an already established disease or condition.
* '''response''': Shows that a biological response has occurred in a patient after being exposed to a medical product or environmental agent.
* '''risk''': Indicates the potential for an individual to develop a disease or condition in the future.
* '''safety''': Measures or indicates the likelihood, nature, or severity of adverse effects, toxicity, or organ injury.

=== specimen ===
Report in all lowercase and <code>specimen_id</code> in the following column should be from UBERON.

=== biomarker ===
The biomarker field is the most important. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase.

==== Cell Biomarker ====
Should be reported as either:

* '''increased *cell name* count'''
* '''decreased *cell name* count'''
* Example: increased WBC count

==== Chemical Element Biomarker ====
Should be reported as either:

* '''increased *chemical element* level'''
* '''decreased *chemical element* level'''
* Example: increased Na+ level

==== DNA/RNA Biomarker ====
Should be reported as either:

* '''increased *DNA/RNA* level'''
* '''decreased *DNA/RNA* level'''
* Example: increased cfDNA level

==== Gene Biomarker ====
If the entity is a gene then there are different ways to report the biomarker based on how the mutation is reported:

* Expression of gene:
** '''*gene symbol* overexpression'''
** '''*gene symbol* underexpression'''
** Example: EGFR overexpression
* Amplification of gene: '''*gene symbol* amplification'''
* Specific site mutation in the expressed protein that is caused by the gene: '''*gene symbol* *site mutation* mutation'''
** Example: BRAF V600E mutation
* SNPs: '''presence of *dbSNP ID* mutation in *gene symbol*'''
** Example: presence of rs180177132 mutation in PALB2

==== Glycan Biomarker ====
Should be reported as: '''increased *glycan* level'''

* Example: increased N-glycan level

==== Metabolite Biomarker ====
Should be reported as:

* '''increased *metabolite* level'''
* '''decreased *metabolite* level'''
* Example: increased Urea level

==== Protein Biomarker ====
Should be reported as either:

* '''increased *HGNC gene symbol* level'''
* '''decreased *HGNC gene symbol* level'''
* Example: increased IL6 level

For more examples please refer to the [https://data.biomarkerkb.org/ BiomarkerKB Data Page]

BiomarkerKB Resource Integration

2026-05-12T20:11:30Z

MariaKim: /* caDSR */

BiomarkerKB collects data from a wide range of resources. Not all collected data are directly integrated into the core data model; some are included as contextual annotations or cross-references to enrich existing entries.

= Resources for Exploration =
*[https://themarker.idrblab.cn/ Marker Database]
*ResMarkerDB
*SalivaDB
*[https://glycanage.com/publications GlycanAge Publications]
*[https://www.cancergenomeinterpreter.org/biomarkers Cancer Genome Interpreter (Biomarkers)]
*[https://github.com/issues/assigned?issue=clinical-biomarkers%7Cbiomarker-issue-repo%7C248 Glycan Biomarkers] ([https://github.com/glygener/CarboCurator code])
*[https://www.alliancegenome.org/ Alliance Genome]

For suggestions of additional biomarker data resources, please contact: mazumder_lab@gwu.edu

= Data Sources =
== GWAS ==
'''Status''': Direct integration into data model
* Genome-wide association studies (GWAS) provide biomarkers in the form of SNPs.
* The GWAS Catalog includes SNPs associated with a wide range of diseases.
** Preliminary curation has only focused on cancer.
** As of 12/11/2026, biomarkers for all available conditions in the GWAS Catalog have been integrated.
* '''License''': CC BY-NC 4.0

== MetaKB ==
'''Status''': Direct integration into data model
* Provides harmonized associations between cancer genomic variants, diseases, and therapeutic evidence.
* Aggregates and standardizes variant interpretation data from six major knowledgebases:
** Clinical Interpretation of Variants in Cancer (CIViC) ''(integrated)''
** OncoKB ''(restricted from commercial use)''
** The Jackson Laboratory Clinical Knowledgebase (JAX-CKB) ''(restricted from commercial use and has share-alike requirements for non-commercial use)''
** MolecularMatch ''(restricted from commercial use)''
** Precision Medicine Knowledgebase) ''(pending integration)''
** Cancer Genome Interpreter (CGI) – through its ''Cancer Biomarkers Database'' component ''(integrated)''
* Enables mapping of:
** Variant → Disease → Drug relationships
** Evidence levels and citations
** Ontology-aligned entities (genes, variants, diseases, drugs)
* Notes:
** Requires validation of entity mappings against BiomarkerKB schema
* Focused on somatic variant–based biomarkers; contextual attributes such as tissue type, therapy response, or evidence type can be inferred or imputed where not directly specified.
* Manual curation may be required for entries with incomplete evidence annotation or lacking standard ontology references.
* Integration approach: direct mapping of variant, condition, and evidence entities; cross-references retained to original data sources.
* License: Aggregated data are available for non-commercial, research use only, respecting constituent licenses:
** CIViC – CC0 (Public Domain)
** PMKB – CC-BY 4.0
** CGI – CC0 for biomarkers database, CC-BY-NC 4.0 for tool
** JAX-CKB – CC-BY-NC-SA 4.0
** OncoKB – custom non-commercial license
** MolecularMatch – restricted commercial use
** MetaKB codebase – MIT license
* Overall usage requires adherence to non-commercial research terms; commercial use needs separate permissions from individual data providers.

== Glycan LLM Biomarkers ==
'''Status''': Direct integration into data model
* LangChain LLM method used to collect biomarkers from PubMed Central abstracts
* Method identifies glycan entities and changes mentioned in them associated to disease
'''Note''': only biomarkers with <code>assessed_entity_type: protein</code> were integrated, with the goal of expanding to glycan entity types once the Glycan Structure Dictionary is finalized.

== Top 50 Biomarkers ==
'''Status''': Direct integration into data model
* Biomarkers collected during Summer Volunteership
* Volunteers identified top 50 biomarker entities from BiomarkerKB
* Using this information the top 50 biomarker entities were searched in PubMed
* 100 biomarkers were manually curated

== EDRN ==
'''Status''': Sample integration into data model
* Cancer biomarkers.
* Sample of EDRN Biomarkers provided from EDRN LLM method
* Biomarkers are extracted from free text in EDRN publicly available biomarkers

== LOINC ==
'''Status''': Direct integration into data model
* Metabolite data only
* We are currently working with the Metabolomics Workbench group to get the complete data

== OncoKB ==
'''Status''': Cross-Reference
* Provides useful information on drugs and therapy options for different biomarker entities.
* Also provides information based on what condition the entity is related to.
* '''License''': A license is required to use OncoKB for commercial and/or clinical purposes, and to access OncoKB data programmatically for academic purposes.
* Paid license is required
* Cross-reference from biomarkers in BiomarkerKB to the appropriate drug information and therapy information is the best solution.

== HPO ==
'''Status''': Cross-Reference
* HPO provides disease and entity associations.
* Does not provide a change within the entity so we cannot collect biomarker data from here.
* However we can use it as a cross-reference within our cross-referencing section.
* Provides cross-reference to OMIM, SNOMED, and MONDO.

== UniProtKB ==
'''Status''': Direct integration into data model
* Can provide biomarker (change in entity), entity, condition, and sampling data.
* This data is in a text file that has to be reviewed fully and to make sure it will be able to be automatically extracted.
* Contextual information can be imputed if necessary.
* In UniProt there are found_in and entries that are actual biomarkers:
** found_in will get a cross-reference;
** actual biomarkers will be directly integrated.
* Manual curation of 56 reviewed entries with mention of "biomarker" in flat text file.
* '''License''': Creative Commons Attribution 4.0 International (CC BY 4.0).

== CIViC ==
'''Status''': Direct integration into data model
* Clinical Interpretation of Variants in Cancer (CIViC).
* Provides cancer biomarkers in form of DNA mutations (dbSNPs).
* Platform provides clinicians treatment options for patients based on unique tumor profile.
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License.

== ClinVar ==
'''Status''': Direct integration into data model
* Public archive of reports of human variations classified for diseases and drug responses.
* Provides biomarkers for all disease, but we have only curated cancer biomarkers for now.
** dbSNPs
** File is really big but will go back and use existing script to map all biomarkers from here into the data model.
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License.
'''Note''': Only biomarkers from "cancer" and "carcinoma" tags were pulled. Pending integration of biomarkers for all diseases.

== MarkerDB ==
'''Status''': Cross-Reference
* Provides a lot of useful biomarker data and cross-references other resources as well.
* Information includes: panel information, abnormal levels of biomarkers by disease, structural information, etc.
* Annotations that can be cross-referenced include the above.
* By cross-referencing, BiomarkerKB will allow users to find more information for specific biomarkers and move towards the goal of being a comprehensive resource for biomarkers.
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License.

== Metabolomics Workbench ==
'''Status''': Direct integration into data model

''Data provided by Metabolomics Workbench''
* Metabolite biomarkers utilized in the uniform newborn screening program.
* Detect treatable disorders that are life threatening or having long-term morbidity, before they become symptomatic.

== OncoMX ==
'''Status''': Direct integration into data model
* Integrated cancer mutation and expression resource for exploring cancer biomarkers
* Manual curation effort by GWU and JPL
* Over 600 single and panel biomarkers
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License.

== OpenTargets ==
'''Status''': Direct integration into data model
* Collects potential drug targets and therapeutic targets.
* Some effort was required to find the correct biomarker data.
* 1200 biomarkers collected.
** dbSNPs related to cancer and other disease
* '''License''': Creative Commons Attribution-NonCommercial 4.0 International License.
'''Note''': Only cancer data was integrated.

== PubMed Central Biomarker Gene Set ==
'''Status''': Direct integration into data model

''Data provided by Avi Ma'ayan's LINCS group''
* This data set was created through manual curation of biomarker gene sets on Pubmed Central using the results of gene sets returned from Rummagene.
* Using the outputted search results within the Rummagene web server, we manually identified publications that associated different conditions and environmental exposures to biomarker gene sets.
* The biomarker gene sets were retrieved through the validation of the gene mentioned within each of the publications.
* The primary use case for this data is to identify biomarker panels/ gene sets associated with conditions.

== SenNet ==
'''Status''': Direction integration into data model
* Cell senescence biomarkers from SenNet group
* Biomarker data was collected and incorporated however biomarker field was incomplete and data integrated was given a score of -2
* Data is still valuable as contextual data and can be revisited to complete biomarker field in future
For infomation about Cross-references and Annotations in BiomarkerKB please visit - [[Xrefs and annotations]]

= Pending Resources =
== biomarker.org ==
Reached out on March 17th, 2026 regarding data access and sent follow-up communications; however, no response was received.

== [https://cadsr.cancer.gov/onedata/Home.jsp caDSR] ==
The Cancer Data Standards Registry and Repository (caDSR) is a metadata registry. It defines Common Data Elements (CDEs), including field names, definitions, and controlled value sets, but does not contain biomarker-condition relationships or evidence.

Recommendation: ingest the first 100 rows; future evaluation will be needed for the rest.

Associated ticket: [https://github.com/clinical-biomarkers/biomarker-issue-repo/issues/396 Integrate CADSR Cancer #396]

2025-10-22T16:24:10Z

MariaKim: /* Protein Biomarker */

==Instructions to submit Biomarker Data==
To submit data for the BiomarkerKB Portal, the biomarker data model must be followed. Instructions on how to format the data for submission, where to send it, and creating a BCO for the data submitted will be provided below.

# Biomarker data collected should follow the biomarker data model.
# "Core" fields should be filled in from the data source where biomarker data is collected. Core fields:
## biomarker
## assessed_biomarker_entity and assessed_biomarker_entity_id
## condition and condition_id OR exposure_agent and exposure_agent_id
# Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources.
# Apply the following standards to the data when possible:
## condition_id = DOID
## specimen_id = UBERON
## evidence_source = "SOURCE":"ID"
## For assessed_biomarker_entity_id please refer to this GitHub documentation for which standards to follow
# Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file.
## For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet.
# Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent.
## The preferred method for data submission is a json file as it will help ingest the data into the existing data efficiently. However, tsv file submissions are ok as well. In the GitHub, data_conversion.py script exists in the Data Conversion Folder and it will handle tsv to json file conversion and json to tsv file conversion as well.
## The biomarker data page has examples of tsv data submissions and how the data should be formatted with the appropriate biomarker fields. Example
# For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation
# If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker.

=== Once data is formatted and cleaned please send any data to daniallmasood@gwu.edu ===
# Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form.
## This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. [https://hivelab.biochemistry.gwu.edu/biomarker-partnership/data/BCO_000435 Example]
# If there are any further questions please consult the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for contributing data or reach out to Daniall using the email above.

==Standardized and Controlled Vocabulary==
There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out.

=== Condition ===
Condition should be reported in all lowercase and condition ID (from Disease Ontology ID) should be provided in the following column

=== assessed_biomarker_entity ===
assessed_biomarker_entity is the entity in which the change is assessed.

Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6).

If the entity type is anything but a gene the whole name should be typed out.

=== assessed_entity_type ===
Report in all lowercase.

=== assessed_biomarker_entity_id ===
Refer to the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for the correct resource.

=== best_biomarker_role ===
Report in all lowercase. Refer to the [https://www.ncbi.nlm.nih.gov/books/NBK326791/ BEST Resource] for the correct biomarker role.

=== specimen ===
Report in all lowercase and specimen_ID in the following column should be from UBERON.

=== biomarker ===
The biomarker field is the most important. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase.

==== Cell Biomarker ====
Should be reported as either:

* '''increased *cell name* count'''
* '''decreased *cell name* count'''
* Example: increased WBC count

==== Chemical Element Biomarker ====
Should be reported as either:

* '''increased *chemical element* level'''
* '''decreased *chemical element* level'''
* Example: increased Na+ level

==== DNA/RNA Biomarker ====
Should be reported as either:

* '''increased *DNA/RNA* level'''
* '''decreased *DNA/RNA* level'''
* Example: increased cfDNA level

==== Gene Biomarker ====
If the entity is a gene then there are different ways to report the biomarker based on how the mutation is reported:

* Expression of gene:
** '''*gene symbol* overexpression'''
** '''*gene symbol* underexpression'''
** Example: EGFR overexpression
* Amplification of gene: '''*gene symbol* amplification'''
* Specific site mutation in the expressed protein that is caused by the gene: '''*gene symbol* *site mutation* mutation'''
** Example: BRAF V600E mutation
* SNPs: '''presence of *dbSNP ID* mutation in *gene symbol*'''
** Example: presence of rs180177132 mutation in PALB2

==== Glycan Biomarker ====
Should be reported as: '''increased *glycan* level'''

* Example: increased N-glycan level

==== Metabolite Biomarker ====
Should be reported as:

* '''increased *metabolite* level'''
* '''decreased *metabolite* level'''
* Example: increased Urea level

==== Protein Biomarker ====
Should be reported as either:

* '''increased *HGNC gene symbol* level'''
* '''decreased *HGNC gene symbol* level'''
* Example: increased IL6 level

For more examples please refer to the [https://data.biomarkerkb.org/ BiomarkerKB Data Page]

Data Submission/Data Upload

2025-10-21T15:19:38Z

MariaKim: /* Metabolite Biomarker */ fix casing

==Instructions to submit Biomarker Data==
To submit data for the BiomarkerKB Portal, the biomarker data model must be followed. Instructions on how to format the data for submission, where to send it, and creating a BCO for the data submitted will be provided below.

# Biomarker data collected should follow the biomarker data model.
# "Core" fields should be filled in from the data source where biomarker data is collected. Core fields:
## biomarker
## assessed_biomarker_entity and assessed_biomarker_entity_id
## condition and condition_id OR exposure_agent and exposure_agent_id
# Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources.
# Apply the following standards to the data when possible:
## condition_id = DOID
## specimen_id = UBERON
## evidence_source = "SOURCE":"ID"
## For assessed_biomarker_entity_id please refer to this GitHub documentation for which standards to follow
# Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file.
## For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet.
# Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent.
## The preferred method for data submission is a json file as it will help ingest the data into the existing data efficiently. However, tsv file submissions are ok as well. In the GitHub, data_conversion.py script exists in the Data Conversion Folder and it will handle tsv to json file conversion and json to tsv file conversion as well.
## The biomarker data page has examples of tsv data submissions and how the data should be formatted with the appropriate biomarker fields. Example
# For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation
# If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker.

=== Once data is formatted and cleaned please send any data to daniallmasood@gwu.edu ===
# Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form.
## This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. [https://hivelab.biochemistry.gwu.edu/biomarker-partnership/data/BCO_000435 Example]
# If there are any further questions please consult the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for contributing data or reach out to Daniall using the email above.

==Standardized and Controlled Vocabulary==
There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out.

=== Condition ===
Condition should be reported in all lowercase and condition ID (from Disease Ontology ID) should be provided in the following column

=== assessed_biomarker_entity ===
assessed_biomarker_entity is the entity in which the change is assessed.

Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6).

If the entity type is anything but a gene the whole name should be typed out.

=== assessed_entity_type ===
Report in all lowercase.

=== assessed_biomarker_entity_id ===
Refer to the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for the correct resource.

=== best_biomarker_role ===
Report in all lowercase. Refer to the [https://www.ncbi.nlm.nih.gov/books/NBK326791/ BEST Resource] for the correct biomarker role.

=== specimen ===
Report in all lowercase and specimen_ID in the following column should be from UBERON.

=== biomarker ===
The biomarker field is the most important. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase.

==== Cell Biomarker ====
Should be reported as either:

* '''increased *cell name* count'''
* '''decreased *cell name* count'''
* Example: increased WBC count

==== Chemical Element Biomarker ====
Should be reported as either:

* '''increased *chemical element* level'''
* '''decreased *chemical element* level'''
* Example: increased Na+ level

==== DNA/RNA Biomarker ====
Should be reported as either:

* '''increased *DNA/RNA* level'''
* '''decreased *DNA/RNA* level'''
* Example: increased cfDNA level

==== Gene Biomarker ====
If the entity is a gene then there are different ways to report the biomarker based on how the mutation is reported:

* Expression of gene:
** '''*gene symbol* overexpression'''
** '''*gene symbol* underexpression'''
** Example: EGFR overexpression
* Amplification of gene: '''*gene symbol* amplification'''
* Specific site mutation in the expressed protein that is caused by the gene: '''*gene symbol* *site mutation* mutation'''
** Example: BRAF V600E mutation
* SNPs: '''presence of *dbSNP ID* mutation in *gene symbol*'''
** Example: presence of rs180177132 mutation in PALB2

==== Glycan Biomarker ====
Should be reported as: '''increased *glycan* level'''

* Example: increased N-glycan level

==== Metabolite Biomarker ====
Should be reported as:

* '''increased *metabolite* level'''
* '''decreased *metabolite* level'''
* Example: increased Urea level

==== Protein Biomarker ====
Should be reported as either:

* '''increased *protein symbol* level'''
* '''decreased *protein symbol* level'''
* Example: increased IL6 level

For more examples please refer to the [https://data.biomarkerkb.org/ BiomarkerKB Data Page]

Data Release Notes

2025-10-17T16:38:35Z

MariaKim: 1.0.4 release

Data Release Notes

2025-09-25T16:10:20Z

MariaKim: Data release 1.0.3

BiomarkerKB Resource Integration

2025-09-15T19:08:47Z

MariaKim: /* UniProtKB */

BiomarkerKB Resource Integration

2025-09-15T19:07:50Z

MariaKim: /* OpenTargets */

BiomarkerKB Resource Integration

2025-09-15T19:07:20Z

MariaKim: /* OncoKB */

BiomarkerKB Resource Integration

2025-09-15T19:06:48Z

MariaKim: /* MarkerDB */