SDC for Knowledge & Context Graph Practitioners

If you're building knowledge graphs or context graphs for RAG systems — read this.

In February 2026, graphrag.info published a piece on contextual GraphRAG that opens with a sentence the SDC community has been waiting years to see in print:

"Vector-based (purely probabilistic) RAG is dead. Deterministic-based RAG is very much alive."

The author makes a strong case that ontologies, query languages, and 60+ years of database methodology are the right substrate for grounded LLM applications — and that most public graphRAG systems have neglected these in favor of vector similarity, PageRank, and brute-force traversal. They cite Graphwise's 95% accuracy on MuSiQue using RDF + SPARQL + LLM-generated ontologies as evidence that the deterministic path is not only viable but quantitatively superior.

SDC agrees with every word of the macro thesis. This page is not a rebuttal. It is an extension. There is one assumption in the prevailing graphRAG conversation — including in that article — that we think is worth re-examining, and surfacing it changes both the engineering picture and the economics of running a graph practice.


The agreement

Before the disagreement, the agreement. SDC and the contextual graphRAG community are aligned on:

  • Determinism over probability for grounded retrieval and multi-hop reasoning
  • RDF, OWL, SPARQL, and SHACL as the right runtime substrate
  • Ontologies as first-class citizens, not afterthoughts
  • The relevance of 60 years of database engineering to a problem the AI community sometimes treats as brand new
  • Auditability and reproducibility as non-negotiable for any system that touches regulated, financial, clinical, or legal data

The Maturity Map's six dimensions — Schema Integrity, Constraint Enforcement, Semantic Identity, Provenance, Interoperability, Governance — are exactly the dimensions a contextual graphRAG system needs to score well on. The SDC ecosystem produces RDF graphs with stable identifiers, validated by SHACL, queryable by SPARQL, with full provenance to source. It is the architecture the article describes.


The unstated assumption

The piece, like most of the graphRAG conversation, assumes that ontology construction itself can safely be done by LLMs. The Graphwise approach generates the ontology with a model and then queries the resulting graph deterministically at runtime. The runtime is deterministic. The substrate is not.

This is the place where SDC pushes the deterministic argument one step further: ontology construction should be deterministic too.

LLM-generated ontologies have five structural problems:

  1. They are not reproducible. Run the same model on the same corpus tomorrow and you will get a different ontology — possibly subtly, possibly substantially. There is no way to say "this ontology" and mean a single, identifiable artifact.

  2. They drift with the model. When the underlying LLM is updated (which happens constantly), the ontology output changes. Your Q1 graph and your Q3 graph were not built against the same conceptual schema. Your application's behavior changes invisibly.

  3. They are hard to version. Which model? Which prompt? Which checkpoint? Which temperature? "The ontology is in ontology.ttl" is not a complete answer when the upstream construction depends on a black box.

  4. They are not defensible to auditors. "The model thought these classes were appropriate" is not an answer in a compliance review. A regulated industry that builds its knowledge graph on LLM-generated ontology is buying a future audit problem it has not yet recognized.

  5. They do not compound economically. Every new client requires a fresh LLM ontology run. There is no library to carry forward. The graph consultant has no compounding moat — engagement #5 costs the same as engagement #1, and the practitioner is a perpetual price-taker on LLM inference costs.

Construction matters. Runtime determinism on a probabilistic substrate is half a victory.

The deeper problem: maximal modeling baked into the standards themselves

There is a related failure mode that affects even the deterministic graph approaches the article praises. The standards being used as building blocks — FHIR, NIEM, HL7, SNOMED CT compositional grammar — are themselves maximally modeled. Every concept enumerates every attribute that any community has ever asked for. The intellectual appeal is "we are being thorough." The practical cost is borne by every implementer for the rest of the standard's life.

If you have ever lived through an IHTSDO post-coordinated SNOMED CT expression — 64572001|Disease (disorder)|: 363698007|Finding site (attribute)| = 80891009|Heart structure (body structure)|, 116676008|Associated morphology (attribute)| = 6920004|Necrosis (morphologic abnormality)| — you know exactly what this feels like. It is beautiful in theory, brutal at the data management layer. Querying, validating, and exchanging post-coordinated expressions becomes a full-time engineering problem. Even openEHR's Clinical Knowledge Manager — a community-driven, careful, well-intentioned archetype library — depends on a committee-approval workflow because maximal-modeling pressure makes every archetype contentious. The bottleneck is not bureaucracy. It is the philosophy.

The SDC ecosystem is built on the opposite discipline: minimum knowledge modeling. A component captures only what is essential to identify the concept and distinguish it from its nearby concepts. Specificity beyond that minimum is achieved through composition with other minimum components at use time, not by inflating any single definition. The locus of specificity is composition, not definition. (See Module 1 §1.6 for the full treatment.)

This is what makes SDC's component library reusable across engagements in a way LLM extraction and post-coordinated standards can never be. Your BloodPressureMeasurement component is the same component for every clinic, every research study, every clinical trial. The local variations — cuff type, patient position, measurement context — are separate components composed in only when a specific deployment needs them. The library compounds because the components themselves do not over-fit. A maximal BloodPressureMeasurement with 187 optional fields would be over-fit to its first client and would not transfer cleanly to the second.

For graph practitioners specifically, this also means the structural identity of every component is a stable CUID2 — completely decoupled from the descriptive label. Your client can rename a component freely, translate it to Spanish for their Mexican operations, refine the wording after a regulatory audit, and none of those changes touch the graph triples that reference the component. The CUID2 carries identity; the label carries humanity. Try doing that with a SNOMED CT concept ID whose meaning is defined by its post-coordinated expression and watch what breaks.


The data-vs-text axis the conversation is missing

There is a deeper distinction the prevailing graphRAG conversation does not name explicitly. Most graphRAG approaches start from documents — PDFs, web pages, internal wikis, free-text fields — and use an LLM to extract entities and relations. The LLM is doing the work of inferring structure from unstructured text. That is genuinely hard, and for genuinely unstructured corpora it is probably necessarily probabilistic.

But the data most organizations actually run on is not unstructured text. It lives in transactional databases, spreadsheets, APIs, JSON files, vendor SaaS exports. The structure is already there. It was put there by the people who designed the system, and it is encoded in tables, columns, types, and constraints. Reading that structure is a deterministic operation. Emitting an RDF graph from it is mechanical.

SDC's framing assumes the input is data, not text. The graph is introspected from the schema, not extracted from a corpus. For organizations whose primary value is in transactional systems — which is most organizations, including almost all SMBs and most enterprise operational data — this changes the economics by orders of magnitude:

LLM-extraction approach SDC introspection approach
Per-record cost LLM tokens per document Effectively zero
Reproducibility Probabilistic Deterministic
Versioning Black-box Component CUID2 + manifest
Audit defensibility "The model decided" "The schema specified"
Schema change cost Re-extract everything Mint a delta component
Runtime performance Same SPARQL substrate Same SPARQL substrate

This is not an argument against LLMs. It is an argument for using LLMs where they add value (UI generation, natural language querying, document-corpus extraction when no structured source exists) and not using them where deterministic alternatives exist (schema definition, identity assignment, constraint authoring, ontology construction from structured sources).

For the document-corpus case where LLM extraction is genuinely necessary, SDC and LLM extraction are complementary. SDC produces the deterministic backbone (the ontology, the constraints, the identifiers); LLM extraction lifts unstructured documents into that backbone. The two approaches together produce a graph that is reproducible at the structural layer and probabilistic only at the boundary where it has to be. That is a much stronger product than LLM extraction alone, and it is what the Graphwise result actually points toward — they just have not finished naming it yet.


The economic counter-argument

This is the part the graphRAG conversation does not address at all, and the part that determines whether building knowledge graphs is a sustainable consulting practice.

Consider two graph practitioners, both serving small-to-mid-market clients in healthcare specialty practices. Both deliver high-quality deterministic graphs at runtime. Both bill their first engagement at $40,000.

Practitioner A uses LLM extraction with LLM-generated ontologies.

  • Engagement #1: 80 hours, $40K, ~$8K of LLM costs eaten by the practitioner
  • Engagement #2: 80 hours, $40K, ~$8K of LLM costs (no reuse)
  • Engagement #5: 80 hours, $40K, ~$8K of LLM costs (still no reuse)
  • Schema change request from existing client: "I'll need to re-run the extraction. That's a $5,000 project."
  • Year 3 effective hourly rate: ~$400/hour, capped by hours-in-the-day

Practitioner B uses SDC component reuse and per-touch stewardship.

  • Engagement #1: 80 hours, $40K, ~$200 of SDCStudio mint costs
  • Engagement #2: 35 hours, $35K (slight reference discount), ~$80 of mint costs (most components reused)
  • Engagement #5: 15 hours, $32K, ~$30 of mint costs (almost everything reused)
  • Schema change request from existing client: "That's a single touch — 45 minutes of my time, billed at $1,500, redeployed by end of day."
  • Year 3 effective hourly rate: ~$1,100/hour blended, with 60+ touches per year of recurring revenue on top of new engagements

The full numbers are in program/practitioner_economics.md. The difference is not subtle. Practitioner B is running a sustainable solo practice on roughly 6 hours of work per week by Year 3. Practitioner A is running a hours-bound consultancy that cannot scale past one person without proportional headcount.

The technical case for deterministic graphs and the economic case for SDC point at the same product. The article makes the first half of the argument. We make the second.


What clients buy that the LLM-extraction approach cannot deliver

When you sell SDC-built graph infrastructure to a client, you offer two structural guarantees that no LLM-extraction approach can match:

  1. No forced software upgrades. The client's bespoke application changes only when they request a change. There is no vendor upgrade calendar, no model deprecation forcing a re-extraction, no SaaS migration breaking their workflows.

  2. No data migrations, ever. The client's data lives in their environment from day one and never moves. New schemas evolve in place. There is no point at which their graph is rebuilt from scratch with possible loss of historical context.

Most clients have personally lived through a forced upgrade or a botched migration. They will never forget it. Offering them a graph infrastructure that makes both impossible by construction is the strongest organic referral mechanism in the entire program.


How this maps to the SDC Practitioner program

If you are a knowledge graph consultant, an ontology engineer, a FHIR practitioner, or a master data consultant currently building graphs by LLM extraction, here is what becoming an SDC Certified Practitioner gives you:

  • A deterministic construction toolchain that produces RDF + SHACL + JSON-LD graphs from your client's structured sources
  • A component reuse model that turns your second engagement in a vertical into a 50% margin improvement and your fifth into a 90% improvement
  • A per-touch stewardship product that lets you charge $1,500-$3,000 for a 45-minute schema change you would otherwise eat
  • A credential that tells the market you have been trained against an open standard with 60 years of database engineering behind it
  • A community of practitioners who are converging on the same architectural answer the article points at

The runtime is the same RDF + SPARQL substrate the article recommends. The construction is deterministic. The economics compound. The clients are happier.


  1. The Practitioner Economics — the worked 3-year example that shows what the model produces in practice
  2. The Practitioner Landing Page — the program structure and how to apply
  3. The SDC Maturity Framework — the six dimensions and the floor constraint principle

If the macro argument in the graphrag.info article resonated with you and you want a path to actually building a sustainable practice on top of it, the Practitioner program is the on-ramp.


This document is a response to and an extension of the contextual GraphRAG article published February 11, 2026 at graphrag.info. We agree with the macro thesis and are grateful to the author for surfacing it. The rest of the argument — construction determinism, the data-vs-text axis, and the economic case — is what SDC contributes to the conversation.