Skip to content
Christian Reich edited this page Nov 5, 2022 · 13 revisions

Background and Motivation

The availability of very large-scale clinical databases in electronic form has opened the possibility to generate systematic and large-scale evidence and insights about healthcare. This discipline is called Observational Outcome Research, and it uses longitudinal patient level clinical data in order to describe and understand the onset of disease and the effect of other clinical events as well as treatment interventions on the progression of the disease. Often, this research constitutes secondary use of the data, as they are being collected for purposes other than research: administrative data such as insurance reimbursement claims and Electronic Health or Medical Records (EHR, EMR).

Because of the collection purpose for primary use, the format and representation of the data follows that primary use. It also introduces artifacts and bias into the data. In addition, all source datasets differ from each other in format and content representation. Since healthcare systems differ between countries, the problem becomes even harder for research carried out internationally. All this makes robust, reproducible and automated research a significant challenge.

The solution is the standardization of the data model (syntax) and a standardization of the representation (coding). This allows methods and tools to operate on data of disparate origin, freeing the analyst from having to dissect the idiosyncrasies of a particular dataset and manipulating the data to make it fit for research. It also allows to develop analytical methods on one dataset, and applying it an any other dataset.

The OMOP Standardized Vocabularies were initially constructed together with the OMOP Common Data Model for the conduct of the OMOP Experiments. Their design is very simple and facilitates the minimum functionality necessary to conduct these experiments, while at the same time allowing for this large resource to be maintained using the very restricted resources of an Open Source Project. After OMOP ended in November 2013, the OMOP Standardized Vocabularies are now supported by OHDSI.

Definitions and Terms

For purposes of the OMOP Common Data Model (CDM), the following terms are used:

Term Description
Standardized Vocabularies Contains a system of Vocabularies, Classifications, Domains and Concepts all consolidated into a common format and stored in a set of CDM tables.
Vocabulary A set of codes or concepts, including if available relationships amongst them, including if available a hierarchy, ontology or taxonomy of the concepts. Many vocabularies are adopted from national or international organizations, such as ICD-9-CM, SNOMED-CT, RxNorm, Read.
Terminology Similar to vocabulary and often used synonymously, but not used in this document.
Coding scheme Similar to vocabulary and often used synonymously, but not used in this document.
Classification A hierarchical system of concepts and concept relationships that defines semantically useful classes, like chemical structures for drugs.
Domain A clinical semantic category, like Drug, Condition, Procedure defined for all Concepts in the Standardized Vocabularies.
Concept Basic unit of information defined in each Vocabulary.
Concept Class An attribute of a concept characterizing it's classification within a Vocabulary. The difference to the Classification is that a Concept Class is a single attribute without any hierarchical structure.

Vocab intro

Principles

The Standardized Vocabularies are constructed with a few principles in mind. Not every principle has been executed to perfection, but it represents a general motivation and direction of the ongoing improvement and development process:

  1. Standardization: Different vocabularies used in observational data are consolidated into a common format. Their entries or codes are called "concepts". All vocabularies and their concepts are stored in one single data table, CONCEPT. This relieves the researchers from having to understand and handle multiple different formats and life cycle conventions of the originating Vocabularies.
  2. Unique Standard Concepts: For each Clinical Entity there is only one concept representing it, called the Standard Concept. Other equivalent or similar concepts are designated non-Standard and mapped to the Standard ones.
  3. Domains: Each concept is assigned one of a list of predefined domains. “Dirty”, i.e. not well defined concepts, which are mostly non-Standard, can also belong to more than one domain. This also defines in which CDM table a clinical entity should be placed into or looked up in at query time.
  4. Comprehensive coverage: Every event in a patient's healthcare experience (e.g. Conditions, Procedures, Exposures to Drug, etc.) and some of the administrative artifacts of the healthcare system (e.g. Visits, Care Sites, etc.) are covered within the domain.
  5. Hierarchy: Within a domain all concepts are organized in a hierarchical structure. This allows to query for all concepts (e.g. drug products) that are hierarchically subsumed under a higher level concept (e.g. a drug class). This entails addressing two separate problems:
    • Each concept should have one or more classifications (bottom up).
    • Each classification should contain all the relevant concepts (top down).
  6. Relationships between concepts within and across vocabularies and Mappings from non-Standard to Standard Concepts.
  7. Life cycle keeping data representation up to date but supporting the processing of deprecated and upgraded concepts and relationships.

Work still needs to be done to achieve all the criteria in all of the domains. Currently, for the most important domains we can achieve the following compliance:

Domain Standardization Unique Concepts Reliable Domains Comprehensive Coverage Hierarchy Mapping
Drug x x for most countries with data x x x
Condition x x mostly x x mostly
Procedure x heavily overlapping x x   x
Measurement x somewhat mostly x minimal  
Device     mostly      
Unit x x x x    

The life cycle is implemented for all concepts, and its rules are described in the CONCEPT table and in the discussion of the individual vocabularies.

It is important to note that these criteria have the purpose to serve observational research. In that regard the Standardized Vocabularies differ from large collections with equivalence mappings of concepts such as the UMLS, which supports indexing and searching of the biomedical literature. UMLS resources have been used heavily as a basis for constructing many of the Standardized Vocabulary components, but significant additional efforts have been made to adjust the framework:

  • Additional Vocabularies, mostly for metadata purposes, are established.
  • Mappings and relationships are being added to achieve comprehensive coverage. If equivalence cannot be achieved, “uphill” relationships from more granular non-standard to higher level Standard Concepts are created.
Clone this wiki locally