Data Submission/Data Upload

From BiomarkerKB Wiki
Revision as of 21:27, 17 June 2026 by JeetVora (talk | contribs) (→‎biomarker_id)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Instructions to submit Biomarker Data

To submit data for the BiomarkerKB Portal, the biomarker data model must be followed. Instructions on how to format the data for submission, where to send it, and creating a BCO for the data submitted will be provided below.

  1. Biomarker data collected should follow the biomarker data model.
  2. "Core" fields should be filled in from the data source where biomarker data is collected. Core fields:
    1. biomarker
    2. assessed_biomarker_entity and assessed_biomarker_entity_id
    3. condition and condition_id OR exposure_agent and exposure_agent_id
    4. component_group containing integers (1, 2, 3...) from 1 to N where N is the number of components. Normally N would simply be equal to the number of rows, unless your data contains multicomponent biomarkers. A multicomponent biomarker must have the same integer in all rows related to that biomarker.
  3. Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources.
    1. evidence is one or more exact citations from the evidence source (in most cases, it will be the PubMed publication).
  4. Apply the following standards to the data when possible:
    1. condition_id = DOID:0080600. Refer to https://disease-ontology.org/do/.
    2. specimen_id = UBERON:0000178. Refer to https://www.ebi.ac.uk/ols4/ontologies/uberon.
    3. loinc_code = LOINC:100153-6. Refer to https://loinc.org/ (you may need to create an account to access the search functionality).
    4. evidence_source = SOURCE:ID, for example PubMed:32677844
    5. For assessed_biomarker_entity_id please refer to the GitHub documentation for which standards to follow
  5. Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file or can be added in the comment field .
    1. For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet.
  6. Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent.
    1. The preferred method for data submission is a json file as it will help ingest the data into the existing data efficiently. However, tsv file submissions are ok as well. In the GitHub, data_conversion.py script exists in the Data Conversion Folder and it will handle tsv to json file conversion and json to tsv file conversion as well.
    2. The [BiomarkerKB data page] has examples of tsv data submissions and how the data should be formatted with the appropriate biomarker fields. Example
  7. For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation
  8. If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker.

Submission

Once data is formatted and cleaned please send any data to mazumder_lab@gwu.edu.

  1. Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form.
    1. This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. Example
  2. If there are any further questions please consult the GitHub Documentation for contributing data or reach out to Daniall using the email above.

BiomarkerKB dataset datamodel fields

There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out.

Biomarker representation framework

A biomarker is not simply a gene, protein, metabolite, or other biological entity. A biomarker must include a defined measurement or change concept — such as presence, absence, increase, or decrease — describing what is observed. For example, EGFR alone is not a biomarker, but a specific EGFR mutation used for diagnostic, prognostic, or treatment-selection purposes is. Likewise, "IL6" alone is not a biomarker, but "increased IL6 expression" in a defined clinical context may be.

The fields below fall into two groups. Core fields directly align with the biomarker definition: biomarker, assessed_biomarker_entity, assessed_biomarker_entity_id, condition, condition_id, exposure_agent, and exposure_agent_id. Contextual fields enrich the representation: specimen, best_biomarker_role, and evidence.

In the BiomarkerKB accession model, the canonical biomarker concept represents the measured change or observation (e.g. "increased IL6 expression"), and disease- or condition-specific records are represented as child records linked to that canonical biomarker.

biomarker_id

A unique identifier assigned to each canonical biomarker concept. The canonical biomarker represents the measured change or observation (e.g. "increased IL6 expression"); disease- or condition-specific records are child records that share the same biomarker_id while differing in condition, specimen, or evidence. biomarker_id is assigned by the biomarkerKB data processing scripts automatically so the field can be left blank.

biomarker

The biomarker field is the most important as follows the BiomarkerKB Controlled Vocabulary for standardized reporting. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase. Examples

  • Increased level of protein SPP1/UPKB:P10451
  • Increased expression of RNA PCA3/HGNC:8637
  • Increased expression of gene B2M PCA3/NCBI:567
  • Increased methylation in gene VIM/NCBI:7431

For more examples please refer to the BiomarkerKB Data Page

assessed_biomarker_entity

assessed_biomarker_entity is the entity in which the change is assessed. Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6). If the entity type is anything but a gene the whole name should be typed out.

assessed_biomarker_entity_id

Assessed Entity Type Resource (in order of preference/availability)
Carbohydrate Chemical Entities of Biological Interest (ChEBI)
Cell Cell Ontology (CO) -> National Cancer Institute Thesaurus (NCIt)
Chemical Element PubChem (PCCID) -> National Cancer Institute Thesaurus (NCIt)
DNA National Cancer Institute Thesaurus (NCIt)
Gene NCBI
Gene (mutation) NCBI dbSNP
Glycan GlyTouCan Accession (GTC) -> PubChem (PCCID)
Lipoprotein Chemical Entities of Biological Interest (ChEBI)
Metabolite PubChem (PCCID) -> Chemical Entities of Biological Interest (ChEBI)
Peptide Protein Ontology (PRO)
Protein Uniprot (UPKB) -> Protein Data Bank (PDB) -> Protein Ontology (PRO) -> National Cancer Institute Thesaurus (NCIt)
Protein Complex Protein Ontology (PRO) -> Gene Ontology (GO)
RNA HUGO Gene Nomenclature Committee (HGNC) -> RNA Central (RNAC)
miRNA miRBase (MRB)

Refer to the GitHub Documentation for the correct resource.

assessed_entity_type

Report in all lowercase. Example: gene

condition

`condition` should be reported in all lowercase. Example: colon cancer

condition_id

`condition_id` (from Disease Ontology, MONDO, or SNOMED or NCIt) should be provided in the following column. Example: DOID:219

exposure_agent

Report in all lowercase. The exposure_agent documents any external stimulus, treatment, environmental factor, or intervention relevant to the biomarker's expression or activity. It provides context for biomarkers that respond to specific exposures rather than intrinsic disease processes (for example, response biomarkers). Leave blank if not applicable. Example: cisplatin

exposure_agent_id

The ontology identifier for the exposure_agent, provided in the following column. Leave blank if not applicable. Example: CHEBI:27899

best_biomarker_role

Report in all lowercase. Refer to the [BEST Resource](https://www.ncbi.nlm.nih.gov/books/NBK326791/) to infer the correct biomarker role. Accepted role terms are:

  • diagnostic: Detects or confirms the presence of a disease or condition, or identifies individuals with a specific disease subtype.
  • monitoring: Assesses the status of a disease, medical condition, or exposure to a medical product over time.
  • predictive: Identifies which patients are more or less likely to respond favorably or unfavorably to a specific treatment or exposure.
  • prognostic: Identifies the likelihood of a clinical event, disease recurrence, or progression in patients with an already established disease or condition.
  • response: Shows that a biological response has occurred in a patient after being exposed to a medical product or environmental agent.
  • risk: Indicates the potential for an individual to develop a disease or condition in the future.
  • safety: Measures or indicates the likelihood, nature, or severity of adverse effects, toxicity, or organ injury.

Example: diagnostic

specimen

Report in all lowercase. Leave blank if not applicable. Example: feces

specimen_id

`specimen_id` in the following column should be from UBERON. Leave blank if not applicable. Example: UBERON:0001988

loinc_code

Report the Logical Observation Identifiers Names and Codes (LOINC) code corresponding to the test or measurement (e.g. 77354-9). Leave blank if not applicable. Example: 77354-9