Data Submission/Data Upload: Difference between revisions

Latest revision as of 21:27, 17 June 2026

Instructions to submit Biomarker Data

To submit data for the BiomarkerKB Portal, the biomarker data model must be followed. Instructions on how to format the data for submission, where to send it, and creating a BCO for the data submitted will be provided below.

Biomarker data collected should follow the biomarker data model.
"Core" fields should be filled in from the data source where biomarker data is collected. Core fields:
1. biomarker
2. assessed_biomarker_entity and assessed_biomarker_entity_id
3. condition and condition_id OR exposure_agent and exposure_agent_id
4. component_group containing integers (1, 2, 3...) from 1 to N where N is the number of components. Normally N would simply be equal to the number of rows, unless your data contains multicomponent biomarkers. A multicomponent biomarker must have the same integer in all rows related to that biomarker.
Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources.
1. evidence is one or more exact citations from the evidence source (in most cases, it will be the PubMed publication).
Apply the following standards to the data when possible:
1. condition_id = DOID:0080600. Refer to https://disease-ontology.org/do/.
2. specimen_id = UBERON:0000178. Refer to https://www.ebi.ac.uk/ols4/ontologies/uberon.
3. loinc_code = LOINC:100153-6. Refer to https://loinc.org/ (you may need to create an account to access the search functionality).
4. evidence_source = SOURCE:ID, for example PubMed:32677844
5. For assessed_biomarker_entity_id please refer to the GitHub documentation for which standards to follow
Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file or can be added in the comment field .
1. For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet.
Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent.
1. The preferred method for data submission is a json file as it will help ingest the data into the existing data efficiently. However, tsv file submissions are ok as well. In the GitHub, data_conversion.py script exists in the Data Conversion Folder and it will handle tsv to json file conversion and json to tsv file conversion as well.
2. The [BiomarkerKB data page] has examples of tsv data submissions and how the data should be formatted with the appropriate biomarker fields. Example
For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation
If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker.

Submission

Once data is formatted and cleaned please send any data to mazumder_lab@gwu.edu.

Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form.
1. This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. Example
If there are any further questions please consult the GitHub Documentation for contributing data or reach out to Daniall using the email above.

BiomarkerKB dataset datamodel fields

There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out.

Biomarker representation framework

A biomarker is not simply a gene, protein, metabolite, or other biological entity. A biomarker must include a defined measurement or change concept — such as presence, absence, increase, or decrease — describing what is observed. For example, EGFR alone is not a biomarker, but a specific EGFR mutation used for diagnostic, prognostic, or treatment-selection purposes is. Likewise, "IL6" alone is not a biomarker, but "increased IL6 expression" in a defined clinical context may be.

The fields below fall into two groups. Core fields directly align with the biomarker definition: biomarker, assessed_biomarker_entity, assessed_biomarker_entity_id, condition, condition_id, exposure_agent, and exposure_agent_id. Contextual fields enrich the representation: specimen, best_biomarker_role, and evidence.

In the BiomarkerKB accession model, the canonical biomarker concept represents the measured change or observation (e.g. "increased IL6 expression"), and disease- or condition-specific records are represented as child records linked to that canonical biomarker.

biomarker_id

A unique identifier assigned to each canonical biomarker concept. The canonical biomarker represents the measured change or observation (e.g. "increased IL6 expression"); disease- or condition-specific records are child records that share the same biomarker_id while differing in condition, specimen, or evidence. biomarker_id is assigned by the biomarkerKB data processing scripts automatically so the field can be left blank.

biomarker

The biomarker field is the most important as follows the BiomarkerKB Controlled Vocabulary for standardized reporting. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase. Examples

Increased level of protein SPP1/UPKB:P10451
Increased expression of RNA PCA3/HGNC:8637
Increased expression of gene B2M PCA3/NCBI:567
Increased methylation in gene VIM/NCBI:7431

For more examples please refer to the BiomarkerKB Data Page

assessed_biomarker_entity

assessed_biomarker_entity is the entity in which the change is assessed. Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6). If the entity type is anything but a gene the whole name should be typed out.

assessed_biomarker_entity_id

Assessed Entity Type	Resource (in order of preference/availability)
Carbohydrate	Chemical Entities of Biological Interest (ChEBI)
Cell	Cell Ontology (CO) -> National Cancer Institute Thesaurus (NCIt)
Chemical Element	PubChem (PCCID) -> National Cancer Institute Thesaurus (NCIt)
DNA	National Cancer Institute Thesaurus (NCIt)
Gene	NCBI
Gene (mutation)	NCBI dbSNP
Glycan	GlyTouCan Accession (GTC) -> PubChem (PCCID)
Lipoprotein	Chemical Entities of Biological Interest (ChEBI)
Metabolite	PubChem (PCCID) -> Chemical Entities of Biological Interest (ChEBI)
Peptide	Protein Ontology (PRO)
Protein	Uniprot (UPKB) -> Protein Data Bank (PDB) -> Protein Ontology (PRO) -> National Cancer Institute Thesaurus (NCIt)
Protein Complex	Protein Ontology (PRO) -> Gene Ontology (GO)
RNA	HUGO Gene Nomenclature Committee (HGNC) -> RNA Central (RNAC)
miRNA	miRBase (MRB)

Refer to the GitHub Documentation for the correct resource.

assessed_entity_type

Report in all lowercase. Example: gene

condition

`condition` should be reported in all lowercase. Example: colon cancer

condition_id

`condition_id` (from Disease Ontology, MONDO, or SNOMED or NCIt) should be provided in the following column. Example: DOID:219

exposure_agent

Report in all lowercase. The exposure_agent documents any external stimulus, treatment, environmental factor, or intervention relevant to the biomarker's expression or activity. It provides context for biomarkers that respond to specific exposures rather than intrinsic disease processes (for example, response biomarkers). Leave blank if not applicable. Example: cisplatin

exposure_agent_id

The ontology identifier for the exposure_agent, provided in the following column. Leave blank if not applicable. Example: CHEBI:27899

best_biomarker_role

Report in all lowercase. Refer to the [BEST Resource](https://www.ncbi.nlm.nih.gov/books/NBK326791/) to infer the correct biomarker role. Accepted role terms are:

diagnostic: Detects or confirms the presence of a disease or condition, or identifies individuals with a specific disease subtype.
monitoring: Assesses the status of a disease, medical condition, or exposure to a medical product over time.
predictive: Identifies which patients are more or less likely to respond favorably or unfavorably to a specific treatment or exposure.
prognostic: Identifies the likelihood of a clinical event, disease recurrence, or progression in patients with an already established disease or condition.
response: Shows that a biological response has occurred in a patient after being exposed to a medical product or environmental agent.
risk: Indicates the potential for an individual to develop a disease or condition in the future.
safety: Measures or indicates the likelihood, nature, or severity of adverse effects, toxicity, or organ injury.

Example: diagnostic

specimen

Report in all lowercase. Leave blank if not applicable. Example: feces

specimen_id

`specimen_id` in the following column should be from UBERON. Leave blank if not applicable. Example: UBERON:0001988

loinc_code

Report the Logical Observation Identifiers Names and Codes (LOINC) code corresponding to the test or measurement (e.g. 77354-9). Leave blank if not applicable. Example: 77354-9

@@ Line 7: / Line 7: @@
 ## <code>assessed_biomarker_entity</code> and <code>assessed_biomarker_entity_id</code>
 ## <code>condition</code> and <code>condition_id</code> OR <code>exposure_agent</code> and <code>exposure_agent_id</code>
-## <code>component_group</code> containing integers (1, 2, 3...). A multicomponent biomarker must have the same integer in all rows related to that biomarker.
+## <code>component_group</code> containing integers (1, 2, 3...) from 1 to N where N is the number of components. Normally N would simply be equal to the number of rows, unless your data contains multicomponent biomarkers. A multicomponent biomarker must have the same integer in all rows related to that biomarker.
 # Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources.
 ## <code>evidence</code> is one or more exact citations from the evidence source (in most cases, it will be the PubMed publication).
@@ Line 16: / Line 16: @@
 ## <code>evidence_source</code> = <code>SOURCE:ID</code>, for example <code>PubMed:32677844</code>
 ## For <code>assessed_biomarker_entity_id</code> please refer to the [https://github.com/clinical-biomarkers/biomarker-controlled-vocabulary GitHub documentation] for which standards to follow
-# Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file.
+# Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file or can be added in the <code>comment</code> field .
 ## For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet.
 # Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent.
@@ Line 30: / Line 30: @@
 # If there are any further questions please consult the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for contributing data or reach out to Daniall using the email above.
-==Standardized and Controlled Vocabulary==
+==BiomarkerKB dataset datamodel fields==
 There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out.
+=== Biomarker representation framework ===
+A biomarker is not simply a gene, protein, metabolite, or other biological entity. A biomarker must include a defined measurement or change concept — such as presence, absence, increase, or decrease — describing what is observed. For example, EGFR alone is not a biomarker, but a specific EGFR mutation used for diagnostic, prognostic, or treatment-selection purposes is. Likewise, "IL6" alone is not a biomarker, but "increased IL6 expression" in a defined clinical context may be.
+The fields below fall into two groups. Core fields directly align with the biomarker definition: <code>biomarker</code>, <code>assessed_biomarker_entity</code>, <code>assessed_biomarker_entity_id</code>, <code>condition</code>, <code>condition_id</code>, <code>exposure_agent</code>, and <code>exposure_agent_id</code>. Contextual fields enrich the representation: <code>specimen</code>, <code>best_biomarker_role</code>, and <code>evidence</code>.
+In the BiomarkerKB accession model, the canonical biomarker concept represents the measured change or observation (e.g. "increased IL6 expression"), and disease- or condition-specific records are represented as child records linked to that canonical biomarker.
+=== biomarker_id ===
+A unique identifier assigned to each canonical biomarker concept. The canonical biomarker represents the measured change or observation (e.g. "increased IL6 expression"); disease- or condition-specific records are child records that share the same biomarker_id while differing in condition, specimen, or evidence. biomarker_id is assigned by the biomarkerKB data processing scripts automatically so the field can be left blank.
-=== condition ===
+=== biomarker ===
-<code>condition</code> should be reported in all lowercase and <code>condition_id</code> (from Disease Ontology, MONDO, or SNOMED) should be provided in the following column
+The biomarker field is the most important as follows the [https://github.com/clinical-biomarkers/biomarker-controlled-vocabulary BiomarkerKB Controlled Vocabulary] for standardized reporting. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase.
+Examples
+* Increased level of protein SPP1/UPKB:P10451
+* Increased expression of RNA PCA3/HGNC:8637
+* Increased expression of gene B2M PCA3/NCBI:567
+* Increased methylation in gene VIM/NCBI:7431<br />
+For more examples please refer to the [https://data.biomarkerkb.org/ BiomarkerKB Data Page]
 === assessed_biomarker_entity ===
 assessed_biomarker_entity is the entity in which the change is assessed.
 Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6).
 If the entity type is anything but a gene the whole name should be typed out.
-=== assessed_entity_type ===
-Report in all lowercase.
 === assessed_biomarker_entity_id ===
+{| class="wikitable"
+!Assessed Entity Type
+!Resource (in order of preference/availability)
+|-
+|Carbohydrate
+|Chemical Entities of Biological Interest (ChEBI)
+|-
+|Cell
+|Cell Ontology (CO) -> National Cancer Institute Thesaurus (NCIt)
+|-
+|Chemical Element
+|PubChem (PCCID) -> National Cancer Institute Thesaurus (NCIt)
+|-
+|DNA
+|National Cancer Institute Thesaurus (NCIt)
+|-
+|Gene
+|NCBI
+|-
+|Gene (mutation)
+|NCBI dbSNP
+|-
+|Glycan
+|GlyTouCan Accession (GTC) -> PubChem (PCCID)
+|-
+|Lipoprotein
+|Chemical Entities of Biological Interest (ChEBI)
+|-
+|Metabolite
+|PubChem (PCCID) -> Chemical Entities of Biological Interest (ChEBI)
+|-
+|Peptide
+|Protein Ontology (PRO)
+|-
+|Protein
+|Uniprot (UPKB) -> Protein Data Bank (PDB) -> Protein Ontology (PRO) -> National Cancer Institute Thesaurus (NCIt)
+|-
+|Protein Complex
+|Protein Ontology (PRO) -> Gene Ontology (GO)
+|-
+|RNA
+|HUGO Gene Nomenclature Committee (HGNC) -> RNA Central (RNAC)
+|-
+|miRNA
+|miRBase (MRB)
+|}
 Refer to the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for the correct resource.
-=== best_biomarker_role ===
+=== assessed_entity_type ===
-Report in all lowercase. Refer to the [https://www.ncbi.nlm.nih.gov/books/NBK326791/ BEST Resource] to infer the correct biomarker role. Accepted role terms are:
+Report in all lowercase.
-* diagnostic
+Example: gene
-* monitoring
-* predictive
-* prognostic
-* response
-* risk
-* safety
-=== specimen ===
+=== condition ===
-Report in all lowercase and <code>specimen_id</code> in the following column should be from UBERON.
+`condition` should be reported in all lowercase.
+Example: colon cancer
-=== biomarker ===
+=== condition_id ===
-The biomarker field is the most important. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase.
+`condition_id` (from Disease Ontology, MONDO, or SNOMED or NCIt) should be provided in the following column.
+Example: DOID:219
-==== Cell Biomarker ====
+=== exposure_agent ===
-Should be reported as either:
+Report in all lowercase. The exposure_agent documents any external stimulus, treatment, environmental factor, or intervention relevant to the biomarker's expression or activity. It provides context for biomarkers that respond to specific exposures rather than intrinsic disease processes (for example, response biomarkers). Leave blank if not applicable.
+Example: cisplatin
-* '''increased *cell name* count'''
+=== exposure_agent_id ===
-* '''decreased *cell name* count'''
+The ontology identifier for the exposure_agent, provided in the following column. Leave blank if not applicable.
-* Example: increased WBC count
+Example: CHEBI:27899
-==== Chemical Element Biomarker ====
+=== best_biomarker_role ===
-Should be reported as either:
+Report in all lowercase. Refer to the [BEST Resource](https://www.ncbi.nlm.nih.gov/books/NBK326791/) to infer the correct biomarker role. Accepted role terms are:
-* '''increased *chemical element* level'''
+* diagnostic: Detects or confirms the presence of a disease or condition, or identifies individuals with a specific disease subtype.
-* '''decreased *chemical element* level'''
+* monitoring: Assesses the status of a disease, medical condition, or exposure to a medical product over time.
-* Example: increased Na+ level
+* predictive: Identifies which patients are more or less likely to respond favorably or unfavorably to a specific treatment or exposure.
+* prognostic: Identifies the likelihood of a clinical event, disease recurrence, or progression in patients with an already established disease or condition.
+* response: Shows that a biological response has occurred in a patient after being exposed to a medical product or environmental agent.
+* risk: Indicates the potential for an individual to develop a disease or condition in the future.
+* safety: Measures or indicates the likelihood, nature, or severity of adverse effects, toxicity, or organ injury.
-==== DNA/RNA Biomarker ====
+Example: diagnostic
-Should be reported as either:
-* '''increased *DNA/RNA* level'''
+=== specimen ===
-* '''decreased *DNA/RNA* level'''
+Report in all lowercase. Leave blank if not applicable.
-* Example: increased cfDNA level
+Example: feces
-==== Gene Biomarker ====
-If the entity is a gene then there are different ways to report the biomarker based on how the mutation is reported:
-* Expression of gene:
-** '''*gene symbol* overexpression'''
-** '''*gene symbol* underexpression'''
-** Example: EGFR overexpression
-* Amplification of gene: '''*gene symbol* amplification'''
-* Specific site mutation in the expressed protein that is caused by the gene: '''*gene symbol* *site mutation* mutation'''
-** Example: BRAF V600E mutation
-* SNPs: '''presence of *dbSNP ID* mutation in *gene symbol*'''
-** Example: presence of rs180177132 mutation in PALB2
-==== Glycan Biomarker ====
+=== specimen_id ===
-Should be reported as: '''increased *glycan* level'''
+`specimen_id` in the following column should be from UBERON. Leave blank if not applicable.
+Example: UBERON:0001988
-* Example: increased N-glycan level
+=== loinc_code ===
+Report the Logical Observation Identifiers Names and Codes (LOINC) code corresponding to the test or measurement (e.g. 77354-9). Leave blank if not applicable.
-==== Metabolite Biomarker ====
+Example: 77354-9
-Should be reported as:
-* '''increased *metabolite* level'''
-* '''decreased *metabolite* level'''
-* Example: increased Urea level
-==== Protein Biomarker ====
-Should be reported as either:
-* '''increased *HGNC gene symbol* level'''
-* '''decreased *HGNC gene symbol* level'''
-* Example: increased IL6 level
-For more examples please refer to the [https://data.biomarkerkb.org/ BiomarkerKB Data Page]

Data Submission/Data Upload: Difference between revisions

Latest revision as of 21:27, 17 June 2026

Contents

Instructions to submit Biomarker Data

Submission

BiomarkerKB dataset datamodel fields

Biomarker representation framework

biomarker_id

biomarker

assessed_biomarker_entity

assessed_biomarker_entity_id

assessed_entity_type

condition

condition_id

exposure_agent

exposure_agent_id

best_biomarker_role

specimen

specimen_id

loinc_code

Navigation menu

Data Submission/Data Upload: Difference between revisions

Latest revision as of 21:27, 17 June 2026

Instructions to submit Biomarker Data

Submission

BiomarkerKB dataset datamodel fields

Biomarker representation framework

biomarker_id

biomarker

assessed_biomarker_entity

assessed_biomarker_entity_id

assessed_entity_type

condition

condition_id

exposure_agent

exposure_agent_id

best_biomarker_role

specimen

specimen_id

loinc_code

Navigation menu

Search