MariaKim: Created page with "= Biomarker identification = A biomarker’s canonical identity is defined by removing the disease or exposure agent dimension from its component combination. For each row in the source TSV, a combination list is first constructed out of three components: (1) assessed entity identifier, (2) condition or exposure agent identifier, and (3) controlled vocabulary term. The condition or exposure agent component is then removed to generate a canonical combination consisting on..."

2026-04-28T20:07:16Z

Created page with "= Biomarker identification = A biomarker’s canonical identity is defined by removing the disease or exposure agent dimension from its component combination. For each row in the source TSV, a combination list is first constructed out of three components: (1) assessed entity identifier, (2) condition or exposure agent identifier, and (3) controlled vocabulary term. The condition or exposure agent component is then removed to generate a canonical combination consisting on..."

New page

= Biomarker identification =
A biomarker’s canonical identity is defined by removing the disease or exposure agent dimension from its component combination. For each row in the source TSV, a combination list is first constructed out of three components: (1) assessed entity identifier, (2) condition or exposure agent identifier, and (3) controlled vocabulary term. The condition or exposure agent component is then removed to generate a canonical combination consisting only of assessed entity identifier and controlled vocabulary pairs. This canonical representation is hashed using MD5 to produce a stable identifier, which serves as a condition-independent anchor across datasets.

Second-level biomarker identifiers are assigned by appending an index to the canonical ID, where the index increments for each distinct condition or exposure agent linked to the same canonical combination, resulting in identifiers such as AN6278-1 and AN6278-2.

To ensure persistence across releases, previously assigned canonical IDs are reused by scanning historical ID-tracking directories. New canonical entries receive sequential identifiers (e.g., BMKB000270). Changes over time are recorded through history tracking, capturing relationships such as inheritance, replacement, and discontinuation.

= Deduplication =
Deduplication is performed at both the biomarker component and row levels to ensure that identical biomarkers are consistently identified across sources. At the component level, the fundamental unit of deduplication is a normalized combination key, defined as a sorted and JSON-serialized list of assessed entity ID, condition or exposure agent ID, and controlled vocabulary tuples. Records from different sources that resolve to the same lowercased combination key are treated as the same biomarker and assigned a shared identifier. Provenance is preserved through a mapping from each unique combination key to the set of source files that contributed it, allowing aggregation without duplication.

At the row level, exact duplicate entries within individual TSV files are removed during ingestion using a dictionary keyed by the serialized row content.

A critical prerequisite for effective deduplication is controlled vocabulary normalization. Semantically equivalent biomarker descriptions, such as “increased IL6 level” and “elevated IL-6 levels,” must first be standardized to a common controlled vocabulary term; otherwise, their component combinations will not match and will be treated as distinct entities.

= Data Modeling =
Data modeling begins with controlled vocabulary normalization. Raw biomarker strings are tokenized and matched against defined pattern definitions, where each rule maps to a structured label of the form change_type, aspect_type, and mod_type (for example, increased, level, not_specified). The matched rule is then used to generate a standardized representation such as “Increased level of protein IL6/UPKB:P05231.” Special cases, including single-nucleotide polymorphisms, mutations, and glycan modifications, are handled through explicit logic. Terms that do not match any pattern are flagged as “[biomarker_term_in_review],” while a supplementary rules file captures hard-coded overrides for edge cases.

The normalized data are then assembled into a document-oriented model, where each record in the c_biomarker MongoDB collection represents a single biomarker entry. Each document includes core biomarker identifiers (canonical IDs and second-level IDs) and a biomarker component array describing each assessed entity, including its identifier, type, controlled vocabulary term, associated specimens, and supporting evidence sources. Disease context is captured in a condition object containing standardized names and synonyms derived from a disease database. Additional fields capture biomarker roles (such as diagnostic or prognostic), aggregated evidence sources, and citation metadata sourced from PubMed. Where available, documents also include normal range statistics, a list of contributing upstream sources, and cross-references to external databases.

= ETL (Extract, Transform, Load) Processes =
The ETL pipeline is organized as a staged workflow that transforms raw source data into structured, queryable databases. In the first step, literature ingestion is performed by a set of scripts, which collect all PubMed identifiers referenced in the source TSV files, retrieve the corresponding MEDLINE XML records, and extract structured citation data into JSON files.

The second step focuses on building reference databases. Disease objects are constructed from GlyGen and Disease Ontology sources, and statistical summaries are computed, including minimum, maximum, mean, median, interquartile range, and whiskers, stratified by age group and sex using clinical datasets from Oracle Health and GWDC.

In the third step, the core databases are assembled. The main biomarker collection is generated, integrating normalized biomarker records with disease annotations, citation data, normal ranges, and cross-references. Supporting structures are then created for efficient querying by generating flattened list representations optimized for search results, including computed relevance scores and filter bitmaps; all searchable fields are tokenized into phrase-level indices to support the search layer.

The final step produces auxiliary databases. Aggregate statistics are generated, initialization metadata for the search system is written, and precomputed sort orders for list fields are created.

All stages of this pipeline depend on upstream dataset preparation. The preprocessing step ingests raw TSV files, applies controlled vocabulary normalization, assigns component group identifiers to link multi-component biomarkers, and outputs intermediate TSV files that serve as the input to the downstream object construction pipeline.

BiomarkerKB Data Processing and Modeling Specification - Revision history