Module 1: Semantic Data Fundamentals

Duration: ~90 minutes self-paced Prerequisites: None Learning objectives: - Explain why data quality is the floor constraint on every other digital initiative - Distinguish syntactic interoperability from semantic interoperability - Describe why purely top-down and purely bottom-up modeling approaches both fail - Summarize the two-level modeling pattern (reference model + domain constraints) - Apply the minimum knowledge modeling principle to component design - Recognize why structural identity must be decoupled from descriptive labels - Identify the role of the SDC specification in the broader semantic data landscape - Explain how SDC models produce context graphs with structural governance and why this matters for client conversations


1.1 The Floor Constraint

Every SMB conversation about "AI" or "data strategy" eventually hits the same wall: the data layer is ambiguous, brittle, or untrusted. You can buy the best model, the slickest BI tool, or the most expensive practitioners — none of it matters if the underlying data cannot be reliably interpreted.

The floor constraint principle: An organization's effective digital maturity is capped by the weakest of its foundational data dimensions. You cannot build a 5-story building on a 1-story foundation.

This is why the Maturity Map starts with the foundational dimensions (Schema Integrity, Constraint Enforcement, Semantic Identity) and treats them as gating. A client scoring Level 4 in Governance but Level 1 in Schema Integrity is functionally at Level 1 — the governance is governing chaos.

Reflection: Think of a current client. What happens when two of their systems disagree about a customer record? Who reconciles it? How long does it take? What does it cost?


1.2 Syntactic vs Semantic Interoperability

Syntactic interoperability means two systems can exchange bytes. JSON in, JSON out. CSV in, CSV out. The pipes connect.

Semantic interoperability means two systems agree on what those bytes mean. When System A says bp: 120, does System B know that means systolic blood pressure in mmHg, taken seated, on the patient's left arm, by a calibrated cuff, at 9:14am?

Most "integration" projects deliver syntactic interoperability and call it done. Then six months later someone discovers the imported data is unusable because the meaning was lost in translation.

Example: A medical practice integrates its EHR with a billing system. Both systems exchange patient_dob. The EHR stores it as YYYY-MM-DD in the patient's local timezone. The billing system stores it as a UTC timestamp at midnight. After daylight saving, 3% of patients are billed under the wrong birth date for insurance verification. The pipes worked. The meaning didn't.


1.3 Why Top-Down Fails

The classical enterprise architecture approach: define a canonical model first, then force every system to map to it.

Why it fails in the wild: - The canonical model is always wrong on day one because nobody knows everything about the domain - Mapping every legacy system is a multi-year project that loses political support before completion - Real-world data has exceptions the canonical model didn't anticipate - The model becomes a museum piece maintained by a small team disconnected from the systems that produce data

Top-down is not stupid. It is correct in spirit. It just cannot be executed against a moving target with finite budget.


1.4 Why Bottom-Up Fails

The opposite approach: let every system define its own data, then use ML/heuristics to reconcile after the fact.

Why it fails: - The reconciliation problem is exponential in the number of systems - Probabilistic matches are not auditable for compliance - New systems require re-training the reconciliation layer - "Good enough" matches accumulate semantic drift over time

Bottom-up is also not stupid. It respects reality. It just cannot guarantee correctness when correctness matters (regulatory, financial, clinical).


1.5 Two-Level Modeling

The synthesis: separate the reference model (stable, slow-changing, generic primitives) from the domain constraints (fast-changing, specific to a use case).

  • Reference model: a small set of well-defined primitives — Quantity, Coded Value, Identifier, Time Point, Person. These rarely change.
  • Domain constraints: rules that specialize the primitives for a context — "systolic BP is a Quantity with unit mmHg, range 40-300, taken in posture X by role Y."

Domain modelers (clinicians, accountants, lawyers) author constraints in their own vocabulary. The reference model guarantees those constraints compose with constraints from other domains because they all bottom out in the same primitives.

This is the architecture behind openEHR, ISO 13606, and the SDC specification. It is the only modeling approach that has survived contact with real enterprises at scale over decades.


1.6 Minimum Knowledge Models

Two-level modeling tells you how to layer a model. It does not tell you how big each layer should be. The second core principle of the SDC approach answers that question.

Most modeling standards — FHIR, NIEM, HL7, SNOMED CT compositional grammar, and many others — try to model concepts maximally. Every possible attribute that might apply to a concept is added to the definition. The intellectual appeal is "we are being thorough." The practical cost is borne by every implementer for the rest of the standard's life: hard to author, hard to govern, hard to query, hard to validate, hard to map.

SDC takes the opposite approach.

A model component captures only what is essential to identify the concept and distinguish it from its nearby concepts.

Not every concept in the universe — just the ones close enough to be confused with it. Specificity beyond that minimum is achieved through composition with other minimum components at use time, not by inflating any single definition.

A concrete example: blood pressure

A maximal model of blood pressure tries to enumerate every cuff type, every patient position, every measurement device, every reference range, every confounding factor — because somewhere, someone needs each of those. The resulting cluster is enormous. Most attributes are never populated for any given measurement. The cluster is hard to author, hard to govern, hard to query, hard to validate, and hard to map to anything else.

A minimum model of blood pressure captures only what makes it blood pressure and distinguishes it from pulse pressure, arterial pressure, or intracranial pressure. Cuff type, patient position, and measurement device are separate components, composed in only when a specific deployment actually needs them. A clinic that never records patient position never deals with the patient-position component. A research study that does compose it explicitly. Both clinic and research study use the same blood pressure component.

Why maximal modeling breaks even with good people

The openEHR Clinical Knowledge Manager (CKM) is a community-driven archetype library built by people who care deeply about correctness. It still depends on a committee-approval workflow for every change. The bottleneck is not bureaucracy — it is the modeling philosophy. Maximal-modeling pressure makes every archetype contentious because every contributor wants to add their specialization, every reviewer wants to defend the boundaries, and the resulting artifact needs human ratification at every step.

A maximal-modeling philosophy generates governance burden that no community can outrun.

SNOMED CT compositional grammar (the IHTSDO post-coordination approach) is the same lesson at industry-standards scale. Beautiful in theory, brutal at the data management layer. Practitioners who have lived through either of these will recognize the pattern immediately and find SDC's discipline liberating.

Where the locus of specificity sits

The deepest implication of minimum modeling: the locus of specificity is composition, not definition. A maximal model puts specificity in the definition — every possible context baked in. A minimum model puts specificity in the composition — you compose blood-pressure + measurement-device + patient-position when and only when you need that specificity.

This is what makes SDC components reusable across engagements in a vertical practice. The minimum components are stable and shared across every client. The compositions are the local variation that captures each client's specific needs. The practitioner's library compounds because the components themselves do not over-fit to any single client. (See program/practitioner_economics.md for the economic consequences.)

Why structural identity is decoupled from descriptive labels

A natural failure mode of maximal modeling is to use the descriptive label as the structural identifier. A concept becomes blood_pressure_systolic_seated_left_arm_calibrated_cuff_2023_revision — the label and the structural identity are the same string, baked into every reference, every query, every export. This is convenient until you need to change anything: rename the concept, translate to another language, improve the wording, refine the meaning. Every change breaks every consumer.

The minimum knowledge modeling discipline forced the SDC ecosystem to a different answer. Structural identity and descriptive labels are completely decoupled. Every component is identified by a CUID2 — a stable, opaque, machine-generated identifier — that is the only thing the structure depends on. The descriptive label is metadata: it can be as expressive as the practitioner wants, in any language, and can be revised freely without breaking anything that references the component. The two never interfere.

This is also where the CUID2 choice was born. UUID v1 leaks timestamp and MAC address. ULID is sortable but reveals creation order. CUID2 is collision-resistant, URL-safe, reasonably short, and does not leak metadata. It is the kind of identifier you can put in a URL, a graph triple, or a JSON-LD @id and trust will mean the same thing in 20 years.

The practical consequence: your client can rename their components freely, translate them to Spanish for their Mexican operations, refine the wording after a regulatory audit, and none of those changes touch a single line of the underlying schema, validator, or generated application. The CUID2 carries the identity; the label carries the humanity.

The principle applied recursively

Minimum knowledge modeling applies at every level of granularity:

  • At the component level: each component models only what is essential to its concept
  • At the assembly level: a data model includes only the components essential to the use case, not every component that might apply
  • At the deployment level: the generated application exposes only the workflows the client actually performs

Maximalism at any layer destroys the benefit at every other layer. Minimalism compounds.


1.7 Where SDC Fits

The SDC specification (currently at the SDC4 reference model generation) is a concrete, open implementation of two-level modeling that:

  • Uses XSD 1.1 + Schematron for structural and constraint enforcement
  • Uses RDF/OWL for semantic identity and reasoning
  • Uses SHACL for graph-level validation
  • Uses JSON-LD for serialization that round-trips through both worlds
  • Binds CUID2 sovereign identifiers to every component for traceability
  • Stores data and constraints together so they cannot drift

Practitioners do not need to teach clients the spec. They need to recognize when a client problem is a two-level modeling problem and prescribe SDC ecosystem tools to solve it.


1.8 What You're Building: A Context Graph with Structural Governance

The industry is converging on the term "context graph" to describe what the enterprise world needs from AI infrastructure. Foundation Capital called it "AI's trillion dollar opportunity." Neo4j is hosting meetups about it. The buzzword is new. The architecture is not.

A context graph is a knowledge graph that contains all of the information necessary to make decisions throughout an organization. Not just the data - the reasoning behind the data, the authority that produced it, the process that governed it, and the provenance that traces it.

When you build an SDC data model, you are building a context graph. Every model automatically generates RDF/OWL/SHACL representations. Every component carries semantic bindings. Every governance element - workflow state machines, attestation models, party/role components, provenance chains - is structurally present in the graph. This is not a separate build step. It is an intrinsic property of the modeling process.

The critical distinction between what SDC produces and what most "context graph" initiatives attempt:

  • Most context graph approaches build the graph by LLM extraction (probabilistic, lossy, requires continuous human review) or by manual triple authoring (expensive, slow, requires graph specialists). Decision traces are logged after the fact by instrumenting an orchestration layer.
  • SDC context graphs are generated deterministically from the data model definition. Decision traces are not logged - they are structurally present in the payload. Governance is not bolted on - it is a property of the data itself. The practitioner's modeling work IS the graph construction.

As Jessica Talisman put it at the inaugural Context Graph meetup: "Context graph is like saying wet water - that's the benefit of graphs." She is right. The industry is discovering what knowledge graphs were always supposed to deliver. SDC delivers it - with governance built in, not added on.

When you talk to clients, use the vocabulary they are hearing from their vendors and conferences: "context graph," "decision traces," "the missing why." Then show them that SDC delivers it structurally, deterministically, and with governance that travels with the data - not in a separate dashboard, not in a vendor's proprietary engine, and not by LLM extraction that requires continuous babysitting.


1.9 Case Study: The $47 Billion Integration Problem

(Reference case study — full text in case_studies/integration_cost.md.)

Summary: A 2022 industry analysis estimated US healthcare spends $47B/year on data integration projects that primarily deliver syntactic interoperability. The semantic layer is rebuilt by every vendor, every project, every time. The Maturity Map exists to make this waste visible to a client in 90 minutes.


Module 1 Quiz

  1. A client reports their dashboard "shows different numbers than the source system." Which maturity dimension is most likely the root cause?
  2. True or false: A client at Level 5 Governance with Level 2 Schema Integrity is operating at functional Level 5.
  3. Why is "let ML reconcile it" insufficient for regulatory reporting?
  4. Name three primitives in a reference model.
  5. What is the difference between an XSD schema and a Schematron rule?
  6. A client's data team has spent two years building a single "Patient" component that contains 187 optional attributes covering every possible clinical scenario the team could think of. What is the most likely consequence of this approach, and what would the minimum knowledge modeling principle recommend instead?

(Answers in quiz_answers/module_1.md.)


Further Reading

  • ISO 13606 reference model overview
  • openEHR architectural overview
  • W3C Data on the Web Best Practices
  • The SDC specification (introduction chapter only for this module)