Data Submission/Data Upload

From BiomarkerKB Wiki
Revision as of 18:30, 24 June 2025 by DaniallMasood (talk | contribs)
Jump to navigation Jump to search

Instructions to submit Biomarker Data To submit data for the BiomarkerKB Portal, the biomarker data model must be followed. Instructions on how to format the data for submission, where to send it, and creating a BCO for the data submitted will be provided below.

  1. Biomarker data collected should follow the biomarker data model.
  2. "Core" fields should be filled in from the data source where biomarker data is collected. Core fields:
    1. biomarker
    2. assessed_biomarker_entity and assessed_biomarker_entity_id
    3. condition and condition_id OR exposure_agent and exposure_agent_id
  3. Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources.
  4. Apply the following standards to the data when possible:
    1. condition_id = DOID
    2. specimen_id = UBERON
    3. evidence_source = "SOURCE":"ID"
    4. For assessed_biomarker_entity_id please refer to this GitHub documentation for which standards to follow
  5. Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file.
    1. For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet.
  6. Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent.
    1. The preferred method for data submission is a json file as it will help ingest the data into the existing data efficiently. However, tsv file submissions are ok as well. In the GitHub, data_conversion.py script exists in the Data Conversion Folder and it will handle tsv to json file conversion and json to tsv file conversion as well.
    2. The biomarker data page has examples of tsv data submissions and how the data should be formatted with the appropriate biomarker fields. Example
  7. For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation
  8. If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker.
  9. Once data is formatted and cleaned please send any data to daniallmasood@gwu.edu
  10. Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form.
    1. This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. Example
  11. If there are any further questions please consult the GitHub Documentation for contributing data or reach out to Daniall using the email above.