Data Submission/Data Upload: Difference between revisions
Created page with "Data Submission/Data Upload" |
No edit summary |
||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Data | ==Instructions to submit Biomarker Data== | ||
To submit data for the BiomarkerKB Portal, the biomarker data model must be followed. Instructions on how to format the data for submission, where to send it, and creating a BCO for the data submitted will be provided below. | |||
# Biomarker data collected should follow the biomarker data model. | |||
# "Core" fields should be filled in from the data source where biomarker data is collected. Core fields: | |||
## biomarker | |||
## assessed_biomarker_entity and assessed_biomarker_entity_id | |||
## condition and condition_id OR exposure_agent and exposure_agent_id | |||
# Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources. | |||
# Apply the following standards to the data when possible: | |||
## condition_id = DOID | |||
## specimen_id = UBERON | |||
## evidence_source = "SOURCE":"ID" | |||
## For assessed_biomarker_entity_id please refer to this GitHub documentation for which standards to follow | |||
# Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file. | |||
## For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet. | |||
# Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent. | |||
## The preferred method for data submission is a json file as it will help ingest the data into the existing data efficiently. However, tsv file submissions are ok as well. In the GitHub, data_conversion.py script exists in the Data Conversion Folder and it will handle tsv to json file conversion and json to tsv file conversion as well. | |||
## The biomarker data page has examples of tsv data submissions and how the data should be formatted with the appropriate biomarker fields. Example | |||
# For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation | |||
# If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker. | |||
# Once data is formatted and cleaned please send any data to daniallmasood@gwu.edu | |||
# Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form. | |||
## This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. [https://hivelab.biochemistry.gwu.edu/biomarker-partnership/data/BCO_000435 Example] | |||
# If there are any further questions please consult the [https://github.com/clinical-biomarkers/biomarker-partnership/blob/main/supplementary_files/documentation/contributing_data.md GitHub Documentation] for contributing data or reach out to Daniall using the email above. | |||
==Standardized Data Reporting== | |||
There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out. | |||
=== Condition === | |||
Condition should be reported in all lowercase and condition ID should be provided in the following column | |||
=== assessed_biomarker_entity === | |||
Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6) | |||
=== Biomarker === | |||
The biomarker field is the most important. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase. | |||
* If the entity for the biomarker is a protein then the biomarker should be reported as "increased/decreased levels of *protein*" | |||
** If reporting a mutation in a protein then "*protein symbol* *site mutation*" | |||
* If the entity is a gene then there are different ways to report the biomarker based on how the mutation is reported: | |||
** If reporting expression of gene then "*gene* overexpression/underexpression" | |||
** If reporting a dbSNP then "presence of *dbSNPID* mutation in *gene*" | |||
** If reporting a site change then "*gene name* *site mutation* | |||
* Other entities such as metabolite, cell, RNA should be reported as protein biomarkers are reported | |||
For more examples please refer to the [https://data.biomarkerkb.org/ BiomarkerKB Data Page] |
Revision as of 18:43, 24 June 2025
Instructions to submit Biomarker Data
To submit data for the BiomarkerKB Portal, the biomarker data model must be followed. Instructions on how to format the data for submission, where to send it, and creating a BCO for the data submitted will be provided below.
- Biomarker data collected should follow the biomarker data model.
- "Core" fields should be filled in from the data source where biomarker data is collected. Core fields:
- biomarker
- assessed_biomarker_entity and assessed_biomarker_entity_id
- condition and condition_id OR exposure_agent and exposure_agent_id
- Other fields and annotations may also be collected from the data source, however if data is missing it can also be inferred or mapped from other sources.
- Apply the following standards to the data when possible:
- condition_id = DOID
- specimen_id = UBERON
- evidence_source = "SOURCE":"ID"
- For assessed_biomarker_entity_id please refer to this GitHub documentation for which standards to follow
- Provide extra annotations from your DCC/data with the agreed upon standards from the Biomarker Annotation RFC. This data does not have to follow the data model and can be submitted in a separate file.
- For example: Relevant EHR data/LOINC data for biomarkers/biomarker entities can be included in a separate sheet.
- Create a tsv/json file with the agreed upon fields which correspond to the biomarker data model. The data dictionary provides details on what the different fields represent.
- The preferred method for data submission is a json file as it will help ingest the data into the existing data efficiently. However, tsv file submissions are ok as well. In the GitHub, data_conversion.py script exists in the Data Conversion Folder and it will handle tsv to json file conversion and json to tsv file conversion as well.
- The biomarker data page has examples of tsv data submissions and how the data should be formatted with the appropriate biomarker fields. Example
- For panel biomarkers, if the biomarkers are part of the same panel, the biomarker_id value for each biomarker should be any string value that can uniquely identify which rows are part of the same biomarker panel. Documentation
- If curating data in tsv format: If biomarker rows are part of the same biomarker entry but differ on specimen, evidence, or role, then the biomarker_id for each row should be any string value that can uniquely identify which rows are part of the same biomarker.
- Once data is formatted and cleaned please send any data to daniallmasood@gwu.edu
- Concurrently with submitting data please fill out the BCO Information: Biomarker Data Google Form.
- This will give metadata and description on how biomarker data was collected and is important for adding submitted data to the Biomarker Data page. An example of a previous BCO is provided in the sheet and available on the biomarker data page as well. Example
- If there are any further questions please consult the GitHub Documentation for contributing data or reach out to Daniall using the email above.
Standardized Data Reporting
There is a standard way to report some biomarker data. This section covers how the actual biomarker should be reported and how other fields should be filled out.
Condition
Condition should be reported in all lowercase and condition ID should be provided in the following column
assessed_biomarker_entity
Should start off with a capital letter but if it is just a gene then it should remain in all capitals (e.g Myosin-binding protein H-like or IL6)
Biomarker
The biomarker field is the most important. There are several distinctions here and changes are made based on the entity being reported. The text should be in lowercase except when a gene name appears then it should remain all uppercase.
- If the entity for the biomarker is a protein then the biomarker should be reported as "increased/decreased levels of *protein*"
- If reporting a mutation in a protein then "*protein symbol* *site mutation*"
- If the entity is a gene then there are different ways to report the biomarker based on how the mutation is reported:
- If reporting expression of gene then "*gene* overexpression/underexpression"
- If reporting a dbSNP then "presence of *dbSNPID* mutation in *gene*"
- If reporting a site change then "*gene name* *site mutation*
- Other entities such as metabolite, cell, RNA should be reported as protein biomarkers are reported
For more examples please refer to the BiomarkerKB Data Page