Data Submission/Data Upload: Difference between revisions

From BiomarkerKB Wiki
Jump to navigation Jump to search
No edit summary
Added controlled vocab
Line 20: Line 20:
# For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation
# For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation
# If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker.
# If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker.
# Once data is formatted and cleaned please send any data to daniallmasood@gwu.edu
 
=== Once data is formatted and cleaned please send any data to daniallmasood@gwu.edu ===
# Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form.
# Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form.
## This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. [https://hivelab.biochemistry.gwu.edu/biomarker-partnership/data/BCO_000435 Example]
## This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. [https://hivelab.biochemistry.gwu.edu/biomarker-partnership/data/BCO_000435 Example]
# If there are any further questions please consult the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for contributing data or reach out to Daniall using the email above.
# If there are any further questions please consult the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for contributing data or reach out to Daniall using the email above.


==Standardized Data Reporting==
==Standardized and Controlled Vocabulary==
There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out.
There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out.


=== Condition ===
=== Condition ===
Condition should be reported in all lowercase and condition ID should be provided in the following column
Condition should be reported in all lowercase and condition ID (from Disease Ontology ID) should be provided in the following column


=== assessed_biomarker_entity ===
=== assessed_biomarker_entity ===
Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6)
Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6)


=== Biomarker ===
=== assessed_entity_type ===
Report in all lowercase.
 
=== assessed_biomarker_entity_id ===
Refer to the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for the correct resource.
 
=== best_biomarker_role ===
Report in all lowercase. Refer to the [https://www.ncbi.nlm.nih.gov/books/NBK326791/ BEST Resource] for the correct biomarker role.
 
=== specimen ===
Report in all lowercase and specimen_ID in the following column should be from UBERON.
 
=== biomarker ===
The biomarker field is the most important. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase.
The biomarker field is the most important. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase.


* If the entity for the biomarker is a protein then the biomarker should be reported as "increased/decreased levels of *protein*"
==== Cell Biomarker ====
** If reporting a mutation in a protein then "*protein symbol* *site mutation*"
Should be reported as either:
* If the entity is a gene then there are different ways to report the biomarker based on how the mutation is reported:
 
** If reporting expression of gene then "*gene* overexpression/underexpression"
* '''increased *cell name* count'''
** If reporting a dbSNP then "presence of *dbSNPID* mutation in *gene*"
* '''decreased *cell name* count'''
** If reporting a site change then "*gene name* *site mutation*
* Example: increased WBC count
* Other entities such as metabolite, cell, RNA should be reported as protein biomarkers are reported
 
==== Chemical Element Biomarker ====
Should be reported as either:
 
* '''increased *chemical element* level'''
* '''decreased *chemical element* level'''
* Example: increased Na+ level
 
==== DNA/RNA Biomarker ====
Should be reported as either:
 
* '''increased *DNA/RNA* level'''
* '''decreased *DNA/RNA* level'''
* Example: increased cfDNA level
 
==== Gene Biomarker ====
If the entity is a gene then there are different ways to report the biomarker based on how the mutation is reported:
 
* Expression of gene:
** '''*gene symbol* overexpression'''
** '''*gene symbol* underexpression'''
** Example: EGFR overexpression
* Amplification of gene: '''*gene symbol* amplification'''
* Specific site mutation in the expressed protein that is caused by the gene: '''*gene symbol* *site mutation* mutation'''
** Example: BRAF V600E mutation
* SNPs: '''presence of *dbSNP ID* mutation in *gene symbol*'''
** Example: presence of rs180177132 mutation in PALB2
 
==== Glycan Biomarker ====
Should be reported as: '''increased *glycan* level'''
 
* Example: increased N-glycan level
 
==== Metabolite Biomarker ====
Should be reported as:
 
* '''increased *metabolite* level'''
* '''decreased *metabolite* level'''
* Example: increased UREA level
 
==== Protein Biomarker ====
Should be reported as either:
 
* '''increased *protein symbol* level'''
* '''decreased *protein symbol* level'''
* Example: increased IL6 level
 




For more examples please refer to the [https://data.biomarkerkb.org/ BiomarkerKB Data Page]
For more examples please refer to the [https://data.biomarkerkb.org/ BiomarkerKB Data Page]

Revision as of 15:29, 25 June 2025

Instructions to submit Biomarker Data

To submit data for the BiomarkerKB Portal, the biomarker data model must be followed. Instructions on how to format the data for submission, where to send it, and creating a BCO for the data submitted will be provided below.

  1. Biomarker data collected should follow the biomarker data model.
  2. "Core" fields should be filled in from the data source where biomarker data is collected. Core fields:
    1. biomarker
    2. assessed_biomarker_entity and assessed_biomarker_entity_id
    3. condition and condition_id OR exposure_agent and exposure_agent_id
  3. Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources.
  4. Apply the following standards to the data when possible:
    1. condition_id = DOID
    2. specimen_id = UBERON
    3. evidence_source = "SOURCE":"ID"
    4. For assessed_biomarker_entity_id please refer to this GitHub documentation for which standards to follow
  5. Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file.
    1. For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet.
  6. Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent.
    1. The preferred method for data submission is a json file as it will help ingest the data into the existing data efficiently. However, tsv file submissions are ok as well. In the GitHub, data_conversion.py script exists in the Data Conversion Folder and it will handle tsv to json file conversion and json to tsv file conversion as well.
    2. The biomarker data page has examples of tsv data submissions and how the data should be formatted with the appropriate biomarker fields. Example
  7. For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation
  8. If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker.

Once data is formatted and cleaned please send any data to daniallmasood@gwu.edu

  1. Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form.
    1. This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. Example
  2. If there are any further questions please consult the GitHub Documentation for contributing data or reach out to Daniall using the email above.

Standardized and Controlled Vocabulary

There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out.

Condition

Condition should be reported in all lowercase and condition ID (from Disease Ontology ID) should be provided in the following column

assessed_biomarker_entity

Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6)

assessed_entity_type

Report in all lowercase.

assessed_biomarker_entity_id

Refer to the GitHub Documentation for the correct resource.

best_biomarker_role

Report in all lowercase. Refer to the BEST Resource for the correct biomarker role.

specimen

Report in all lowercase and specimen_ID in the following column should be from UBERON.

biomarker

The biomarker field is the most important. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase.

Cell Biomarker

Should be reported as either:

  • increased *cell name* count
  • decreased *cell name* count
  • Example: increased WBC count

Chemical Element Biomarker

Should be reported as either:

  • increased *chemical element* level
  • decreased *chemical element* level
  • Example: increased Na+ level

DNA/RNA Biomarker

Should be reported as either:

  • increased *DNA/RNA* level
  • decreased *DNA/RNA* level
  • Example: increased cfDNA level

Gene Biomarker

If the entity is a gene then there are different ways to report the biomarker based on how the mutation is reported:

  • Expression of gene:
    • *gene symbol* overexpression
    • *gene symbol* underexpression
    • Example: EGFR overexpression
  • Amplification of gene: *gene symbol* amplification
  • Specific site mutation in the expressed protein that is caused by the gene: *gene symbol* *site mutation* mutation
    • Example: BRAF V600E mutation
  • SNPs: presence of *dbSNP ID* mutation in *gene symbol*
    • Example: presence of rs180177132 mutation in PALB2

Glycan Biomarker

Should be reported as: increased *glycan* level

  • Example: increased N-glycan level

Metabolite Biomarker

Should be reported as:

  • increased *metabolite* level
  • decreased *metabolite* level
  • Example: increased UREA level

Protein Biomarker

Should be reported as either:

  • increased *protein symbol* level
  • decreased *protein symbol* level
  • Example: increased IL6 level


For more examples please refer to the BiomarkerKB Data Page