Data Submission/Data Upload: Difference between revisions
| (6 intermediate revisions by 2 users not shown) | |||
| Line 7: | Line 7: | ||
## <code>assessed_biomarker_entity</code> and <code>assessed_biomarker_entity_id</code> | ## <code>assessed_biomarker_entity</code> and <code>assessed_biomarker_entity_id</code> | ||
## <code>condition</code> and <code>condition_id</code> OR <code>exposure_agent</code> and <code>exposure_agent_id</code> | ## <code>condition</code> and <code>condition_id</code> OR <code>exposure_agent</code> and <code>exposure_agent_id</code> | ||
## <code>component_group</code> containing integers (1, 2, 3...). A multicomponent biomarker must have the same integer in all rows related to that biomarker. | ## <code>component_group</code> containing integers (1, 2, 3...) from 1 to N where N is the number of components. Normally N would simply be equal to the number of rows, unless your data contains multicomponent biomarkers. A multicomponent biomarker must have the same integer in all rows related to that biomarker. | ||
# Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources. | # Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources. | ||
## <code>evidence</code> is one or more exact citations from the evidence source (in most cases, it will be the PubMed publication). | ## <code>evidence</code> is one or more exact citations from the evidence source (in most cases, it will be the PubMed publication). | ||
| Line 16: | Line 16: | ||
## <code>evidence_source</code> = <code>SOURCE:ID</code>, for example <code>PubMed:32677844</code> | ## <code>evidence_source</code> = <code>SOURCE:ID</code>, for example <code>PubMed:32677844</code> | ||
## For <code>assessed_biomarker_entity_id</code> please refer to the [https://github.com/clinical-biomarkers/biomarker-controlled-vocabulary GitHub documentation] for which standards to follow | ## For <code>assessed_biomarker_entity_id</code> please refer to the [https://github.com/clinical-biomarkers/biomarker-controlled-vocabulary GitHub documentation] for which standards to follow | ||
# Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file. | # Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file or can be added in the <code>comment</code> field . | ||
## For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet. | ## For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet. | ||
# Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent. | # Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent. | ||
| Line 30: | Line 30: | ||
# If there are any further questions please consult the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for contributing data or reach out to Daniall using the email above. | # If there are any further questions please consult the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for contributing data or reach out to Daniall using the email above. | ||
== | ==BiomarkerKB dataset datamodel fields== | ||
There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out. | There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out. | ||
=== Biomarker representation framework === | |||
A biomarker is not simply a gene, protein, metabolite, or other biological entity. A biomarker must include a defined measurement or change concept — such as presence, absence, increase, or decrease — describing what is observed. For example, EGFR alone is not a biomarker, but a specific EGFR mutation used for diagnostic, prognostic, or treatment-selection purposes is. Likewise, "IL6" alone is not a biomarker, but "increased IL6 expression" in a defined clinical context may be. | |||
The fields below fall into two groups. Core fields directly align with the biomarker definition: <code>biomarker</code>, <code>assessed_biomarker_entity</code>, <code>assessed_biomarker_entity_id</code>, <code>condition</code>, <code>condition_id</code>, <code>exposure_agent</code>, and <code>exposure_agent_id</code>. Contextual fields enrich the representation: <code>specimen</code>, <code>best_biomarker_role</code>, and <code>evidence</code>. | |||
In the BiomarkerKB accession model, the canonical biomarker concept represents the measured change or observation (e.g. "increased IL6 expression"), and disease- or condition-specific records are represented as child records linked to that canonical biomarker. | |||
=== biomarker_id === | |||
A unique identifier assigned to each canonical biomarker concept. The canonical biomarker represents the measured change or observation (e.g. "increased IL6 expression"); disease- or condition-specific records are child records that share the same biomarker_id while differing in condition, specimen, or evidence. biomarker_id is assigned by the biomarkerKB data processing scripts automatically so the field can be left blank. | |||
=== | === biomarker === | ||
The biomarker field is the most important as follows the [https://github.com/clinical-biomarkers/biomarker-controlled-vocabulary BiomarkerKB Controlled Vocabulary] for standardized reporting. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase. | |||
Examples | |||
* Increased level of protein SPP1/UPKB:P10451 | |||
* Increased expression of RNA PCA3/HGNC:8637 | |||
* Increased expression of gene B2M PCA3/NCBI:567 | |||
* Increased methylation in gene VIM/NCBI:7431<br /> | |||
For more examples please refer to the [https://data.biomarkerkb.org/ BiomarkerKB Data Page] | |||
=== assessed_biomarker_entity === | === assessed_biomarker_entity === | ||
assessed_biomarker_entity is the entity in which the change is assessed. | assessed_biomarker_entity is the entity in which the change is assessed. | ||
Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6). | Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6). | ||
If the entity type is anything but a gene the whole name should be typed out. | If the entity type is anything but a gene the whole name should be typed out. | ||
=== assessed_biomarker_entity_id === | === assessed_biomarker_entity_id === | ||
{| class="wikitable" | |||
!Assessed Entity Type | |||
!Resource (in order of preference/availability) | |||
|- | |||
|Carbohydrate | |||
|Chemical Entities of Biological Interest (ChEBI) | |||
|- | |||
|Cell | |||
|Cell Ontology (CO) -> National Cancer Institute Thesaurus (NCIt) | |||
|- | |||
|Chemical Element | |||
|PubChem (PCCID) -> National Cancer Institute Thesaurus (NCIt) | |||
|- | |||
|DNA | |||
|National Cancer Institute Thesaurus (NCIt) | |||
|- | |||
|Gene | |||
|NCBI | |||
|- | |||
|Gene (mutation) | |||
|NCBI dbSNP | |||
|- | |||
|Glycan | |||
|GlyTouCan Accession (GTC) -> PubChem (PCCID) | |||
|- | |||
|Lipoprotein | |||
|Chemical Entities of Biological Interest (ChEBI) | |||
|- | |||
|Metabolite | |||
|PubChem (PCCID) -> Chemical Entities of Biological Interest (ChEBI) | |||
|- | |||
|Peptide | |||
|Protein Ontology (PRO) | |||
|- | |||
|Protein | |||
|Uniprot (UPKB) -> Protein Data Bank (PDB) -> Protein Ontology (PRO) -> National Cancer Institute Thesaurus (NCIt) | |||
|- | |||
|Protein Complex | |||
|Protein Ontology (PRO) -> Gene Ontology (GO) | |||
|- | |||
|RNA | |||
|HUGO Gene Nomenclature Committee (HGNC) -> RNA Central (RNAC) | |||
|- | |||
|miRNA | |||
|miRBase (MRB) | |||
|} | |||
Refer to the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for the correct resource. | Refer to the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for the correct resource. | ||
=== | === assessed_entity_type === | ||
Report in all lowercase. | Report in all lowercase. | ||
Example: gene | |||
=== | === condition === | ||
`condition` should be reported in all lowercase. | |||
Example: colon cancer | |||
=== | === condition_id === | ||
`condition_id` (from Disease Ontology, MONDO, or SNOMED or NCIt) should be provided in the following column. | |||
Example: DOID:219 | |||
=== | === exposure_agent === | ||
Report in all lowercase. The exposure_agent documents any external stimulus, treatment, environmental factor, or intervention relevant to the biomarker's expression or activity. It provides context for biomarkers that respond to specific exposures rather than intrinsic disease processes (for example, response biomarkers). Leave blank if not applicable. | |||
Example: cisplatin | |||
=== exposure_agent_id === | |||
The ontology identifier for the exposure_agent, provided in the following column. Leave blank if not applicable. | |||
Example: CHEBI:27899 | |||
=== | === best_biomarker_role === | ||
Report in all lowercase. Refer to the [BEST Resource](https://www.ncbi.nlm.nih.gov/books/NBK326791/) to infer the correct biomarker role. Accepted role terms are: | |||
* | * diagnostic: Detects or confirms the presence of a disease or condition, or identifies individuals with a specific disease subtype. | ||
* | * monitoring: Assesses the status of a disease, medical condition, or exposure to a medical product over time. | ||
* | * predictive: Identifies which patients are more or less likely to respond favorably or unfavorably to a specific treatment or exposure. | ||
* prognostic: Identifies the likelihood of a clinical event, disease recurrence, or progression in patients with an already established disease or condition. | |||
* response: Shows that a biological response has occurred in a patient after being exposed to a medical product or environmental agent. | |||
* risk: Indicates the potential for an individual to develop a disease or condition in the future. | |||
* safety: Measures or indicates the likelihood, nature, or severity of adverse effects, toxicity, or organ injury. | |||
Example: diagnostic | |||
=== specimen === | |||
Report in all lowercase. Leave blank if not applicable. | |||
Example: feces | |||
=== | === specimen_id === | ||
`specimen_id` in the following column should be from UBERON. Leave blank if not applicable. | |||
Example: UBERON:0001988 | |||
=== loinc_code === | |||
Report the Logical Observation Identifiers Names and Codes (LOINC) code corresponding to the test or measurement (e.g. 77354-9). Leave blank if not applicable. | |||
Example: 77354-9 | |||
Latest revision as of 21:27, 17 June 2026
Instructions to submit Biomarker Data
To submit data for the BiomarkerKB Portal, the biomarker data model must be followed. Instructions on how to format the data for submission, where to send it, and creating a BCO for the data submitted will be provided below.
- Biomarker data collected should follow the biomarker data model.
- "Core" fields should be filled in from the data source where biomarker data is collected. Core fields:
biomarkerassessed_biomarker_entityandassessed_biomarker_entity_idconditionandcondition_idORexposure_agentandexposure_agent_idcomponent_groupcontaining integers (1, 2, 3...) from 1 to N where N is the number of components. Normally N would simply be equal to the number of rows, unless your data contains multicomponent biomarkers. A multicomponent biomarker must have the same integer in all rows related to that biomarker.
- Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources.
evidenceis one or more exact citations from the evidence source (in most cases, it will be the PubMed publication).
- Apply the following standards to the data when possible:
condition_id=DOID:0080600. Refer to https://disease-ontology.org/do/.specimen_id=UBERON:0000178. Refer to https://www.ebi.ac.uk/ols4/ontologies/uberon.loinc_code=LOINC:100153-6. Refer to https://loinc.org/ (you may need to create an account to access the search functionality).evidence_source=SOURCE:ID, for examplePubMed:32677844- For
assessed_biomarker_entity_idplease refer to the GitHub documentation for which standards to follow
- Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file or can be added in the
commentfield .- For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet.
- Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent.
- The preferred method for data submission is a json file as it will help ingest the data into the existing data efficiently. However, tsv file submissions are ok as well. In the GitHub,
data_conversion.pyscript exists in the Data Conversion Folder and it will handle tsv to json file conversion and json to tsv file conversion as well. - The [BiomarkerKB data page] has examples of tsv data submissions and how the data should be formatted with the appropriate biomarker fields. Example
- The preferred method for data submission is a json file as it will help ingest the data into the existing data efficiently. However, tsv file submissions are ok as well. In the GitHub,
- For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation
- If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker.
Submission
Once data is formatted and cleaned please send any data to mazumder_lab@gwu.edu.
- Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form.
- This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. Example
- If there are any further questions please consult the GitHub Documentation for contributing data or reach out to Daniall using the email above.
BiomarkerKB dataset datamodel fields
There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out.
Biomarker representation framework
A biomarker is not simply a gene, protein, metabolite, or other biological entity. A biomarker must include a defined measurement or change concept — such as presence, absence, increase, or decrease — describing what is observed. For example, EGFR alone is not a biomarker, but a specific EGFR mutation used for diagnostic, prognostic, or treatment-selection purposes is. Likewise, "IL6" alone is not a biomarker, but "increased IL6 expression" in a defined clinical context may be.
The fields below fall into two groups. Core fields directly align with the biomarker definition: biomarker, assessed_biomarker_entity, assessed_biomarker_entity_id, condition, condition_id, exposure_agent, and exposure_agent_id. Contextual fields enrich the representation: specimen, best_biomarker_role, and evidence.
In the BiomarkerKB accession model, the canonical biomarker concept represents the measured change or observation (e.g. "increased IL6 expression"), and disease- or condition-specific records are represented as child records linked to that canonical biomarker.
biomarker_id
A unique identifier assigned to each canonical biomarker concept. The canonical biomarker represents the measured change or observation (e.g. "increased IL6 expression"); disease- or condition-specific records are child records that share the same biomarker_id while differing in condition, specimen, or evidence. biomarker_id is assigned by the biomarkerKB data processing scripts automatically so the field can be left blank.
biomarker
The biomarker field is the most important as follows the BiomarkerKB Controlled Vocabulary for standardized reporting. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase. Examples
- Increased level of protein SPP1/UPKB:P10451
- Increased expression of RNA PCA3/HGNC:8637
- Increased expression of gene B2M PCA3/NCBI:567
- Increased methylation in gene VIM/NCBI:7431
For more examples please refer to the BiomarkerKB Data Page
assessed_biomarker_entity
assessed_biomarker_entity is the entity in which the change is assessed. Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6). If the entity type is anything but a gene the whole name should be typed out.
assessed_biomarker_entity_id
| Assessed Entity Type | Resource (in order of preference/availability) |
|---|---|
| Carbohydrate | Chemical Entities of Biological Interest (ChEBI) |
| Cell | Cell Ontology (CO) -> National Cancer Institute Thesaurus (NCIt) |
| Chemical Element | PubChem (PCCID) -> National Cancer Institute Thesaurus (NCIt) |
| DNA | National Cancer Institute Thesaurus (NCIt) |
| Gene | NCBI |
| Gene (mutation) | NCBI dbSNP |
| Glycan | GlyTouCan Accession (GTC) -> PubChem (PCCID) |
| Lipoprotein | Chemical Entities of Biological Interest (ChEBI) |
| Metabolite | PubChem (PCCID) -> Chemical Entities of Biological Interest (ChEBI) |
| Peptide | Protein Ontology (PRO) |
| Protein | Uniprot (UPKB) -> Protein Data Bank (PDB) -> Protein Ontology (PRO) -> National Cancer Institute Thesaurus (NCIt) |
| Protein Complex | Protein Ontology (PRO) -> Gene Ontology (GO) |
| RNA | HUGO Gene Nomenclature Committee (HGNC) -> RNA Central (RNAC) |
| miRNA | miRBase (MRB) |
Refer to the GitHub Documentation for the correct resource.
assessed_entity_type
Report in all lowercase. Example: gene
condition
`condition` should be reported in all lowercase. Example: colon cancer
condition_id
`condition_id` (from Disease Ontology, MONDO, or SNOMED or NCIt) should be provided in the following column. Example: DOID:219
exposure_agent
Report in all lowercase. The exposure_agent documents any external stimulus, treatment, environmental factor, or intervention relevant to the biomarker's expression or activity. It provides context for biomarkers that respond to specific exposures rather than intrinsic disease processes (for example, response biomarkers). Leave blank if not applicable. Example: cisplatin
exposure_agent_id
The ontology identifier for the exposure_agent, provided in the following column. Leave blank if not applicable. Example: CHEBI:27899
best_biomarker_role
Report in all lowercase. Refer to the [BEST Resource](https://www.ncbi.nlm.nih.gov/books/NBK326791/) to infer the correct biomarker role. Accepted role terms are:
- diagnostic: Detects or confirms the presence of a disease or condition, or identifies individuals with a specific disease subtype.
- monitoring: Assesses the status of a disease, medical condition, or exposure to a medical product over time.
- predictive: Identifies which patients are more or less likely to respond favorably or unfavorably to a specific treatment or exposure.
- prognostic: Identifies the likelihood of a clinical event, disease recurrence, or progression in patients with an already established disease or condition.
- response: Shows that a biological response has occurred in a patient after being exposed to a medical product or environmental agent.
- risk: Indicates the potential for an individual to develop a disease or condition in the future.
- safety: Measures or indicates the likelihood, nature, or severity of adverse effects, toxicity, or organ injury.
Example: diagnostic
specimen
Report in all lowercase. Leave blank if not applicable. Example: feces
specimen_id
`specimen_id` in the following column should be from UBERON. Leave blank if not applicable. Example: UBERON:0001988
loinc_code
Report the Logical Observation Identifiers Names and Codes (LOINC) code corresponding to the test or measurement (e.g. 77354-9). Leave blank if not applicable. Example: 77354-9