Markdown Template Format Reference
This is the authoritative reference for the Markdown template format consumed by SDCStudio's md2pd parser. It mirrors the parser's actual behavior; if the parser and this document disagree, the parser is correct and this document is a bug.
The parser is implemented at src/md2pd/agents/template_parser_agent.py and orchestrated by src/md2pd/core/workflow_controller.py.
File anatomy
---
<YAML front matter>
---
# Dataset Overview
<prose plus optional **Purpose** / **Business Context** keywords>
## <NamedTree>: <Name>
...
A template has three pieces: YAML front matter, an optional # Dataset Overview H1 section, and one or more H2 named-tree sections. At least one ## Data: section is required; everything else is optional.
YAML front matter
---
template_version: "4.0.0"
dataset:
name: "Dataset Name"
description: "What the dataset is for."
creator: "Author or Organization"
project: "project_ct_id"
enrichment:
enable_llm: true
---
| Key | Required | Type | Notes |
|---|---|---|---|
template_version |
Yes | string | Major version must be 4 (e.g. "4.0.0", "4.2.1"). Anything else fails validation. |
dataset.name |
No | string | Falls back to the file stem if omitted. |
dataset.description |
No | string | Stored on the parsed model. |
dataset.creator |
No | string | Stored as dm_creator. |
dataset.project |
No | string | Stored as dm_project. |
enrichment.enable_llm |
No | bool | Default: true (overridable via the MD2PD_DEFAULT_ENRICHMENT setting). When true, an optional Phase 2 LLM enrichment runs after Phase 1 parsing succeeds. |
Other top-level YAML keys are ignored.
# Dataset Overview (H1, optional)
A single H1 named exactly Dataset Overview. Free prose is allowed. Two bold keywords are extracted if present:
| Keyword | Stored as |
|---|---|
**Purpose**: |
Overrides dataset.description. |
**Business Context**: |
Stored in semantic_context.business_context. |
All other prose in this section is informational and not parsed.
Named-tree sections (H2)
The parser recognizes exactly eight H2 section types, each in the form ## <Type>: <Name>. The colon and name are required for all types except Links. Sections not in this list are not parsed.
| Type | Multiplicity | Purpose |
|---|---|---|
Data |
1+ required | Defines a flat cluster of components. |
Workflow |
optional, multiple | Same shape as Data (recognized but currently not stored in the parsed output — see note below). |
Subject |
optional, max 1 | The data subject as a Party. |
Provider |
optional, max 1 | The data provider as a Party. |
Participation |
optional, multiple | A Participation with a performer Party. |
Attestation |
optional, max 1 | An Attestation node (View / Proof / Reason). |
Audit |
optional, multiple | Audit metadata. |
Links |
optional, max 1 | External URI references. |
Note on multiple
## Data:sections. If you write more than one## Data:section, only the last one parsed is retained as the cluster hierarchy; earlier ones are discarded. Use a single## Data:per template.Note on
## Workflow:. The parser recognizes this section and parses its contents, but the resultingWorkflowcluster is not currently included in the parsed output structure. Use## Data:for content you need to round-trip.
## Data: <Name>
The required cluster section. Contains columns and optional cluster-level keywords.
## Data: Patient Demographics
Brief prose description of the cluster.
**Purpose**: What this cluster represents
**Business Context**: How it's used
**Rules**:
- end_date must be after start_date
- At least one of email or phone must be provided
### full_name
**Type**: text
**Description**: Patient's full name.
### date_of_birth
**Type**: date
**Description**: Patient's date of birth.
Cluster-level keywords (all optional):
| Keyword | Format | Effect |
|---|---|---|
**Purpose**: |
flat | Stored as cluster.purpose. |
**Business Context**: |
flat | Stored as cluster.business_context. |
**Rules**: |
bulleted list | Cross-field validation rules; stored as cluster.rules: List[str]. Translated to assertion expressions during downstream XSD generation. |
The first paragraph of prose (before any **Keyword**: or ### column) becomes the cluster description.
## Subject: <Name> and ## Provider: <Name>
Each captures a Party. Recognized fields:
## Subject: Patient
**Description**: Person whose health record this is.
### patient_id
**Type**: text
**Description**: Medical record number.
| Keyword | Effect |
|---|---|
**Description**: |
Stored on the PartyNode. |
### col_name |
Columns belonging to this Party. |
## Subject: sets role='subject'; ## Provider: sets role='provider'. Multiple ## Subject: or ## Provider: sections are not supported (only one of each is retained).
## Participation: <Name>
## Participation: Examining Clinician
**Description**: Clinician who performed the examination.
**Function**: Examiner
**Function Description**: Performs and documents physical examination.
**Mode**: present
**Mode Description**: In-person examination.
### clinician_id
**Type**: text
**Description**: Clinician identifier.
| Keyword | Effect |
|---|---|
**Description**: |
Stored on the ParticipationNode. |
**Function**: |
The participation function label. |
**Function Description**: |
The function description. |
**Mode**: |
The participation mode label. |
**Mode Description**: |
The mode description. |
### col_name |
Columns become an inline performer Party (role='performer'). |
## Participation: may appear multiple times.
## Attestation: <Name>
## Attestation: Encounter Sign-off
- **View**: Clinical Summary (`application/pdf`, content_mode=`url`)
- **Proof**: Clinician Signature (`application/pgp-signature`, content_mode=`embed`)
- **Reason**: Attestation Reason (XdString, min_length=10, max_length=500)
Each line uses the form **Field**: label (parenthetical). The parenthetical is optional but, when present, is parsed as follows:
| Field | Parenthetical contents |
|---|---|
**View**: |
Type (e.g. XdFile), backtick-quoted media type (`application/pdf`), content_mode=url or content_mode=embed. |
**Proof**: |
Same as View. |
**Reason**: |
Type (e.g. XdString), min_length=N, max_length=N. Default min/max = 1 / 500 if omitted. |
Order within the parenthetical does not matter; the parser extracts each piece by regex.
## Audit: <Name>
## Audit: System Provenance
**System ID**: ehr-prod-01
**System User**: clinician_jdoe
**Location**: Porto Sereno General Hospital
Audit sections capture flat metadata and are stored in the model's dm_audit list. Recognized fields: **System ID**:, **System User**:, **Location**:. Multiple ## Audit: sections are allowed.
## Links:
## Links:
- https://www.w3.org/TR/prov-o/
- urn:oid:2.16.840.1.113883.4.1
- https://semanticdatacharter.com/ns/sdc4/
A bulleted list of URIs. Lines that begin with http, https, or urn: are stored in the model's dm_links list. Other content is ignored. The name after the colon is optional.
Column format (### name)
Inside any section that contains columns (Data, Workflow, Subject, Provider, Participation), each column is introduced by an H3:
### column_name
**Type**: text
**Description**: ...
**Constraints**:
- required: true
**Examples**: example1, example2
The text after ### is the column name verbatim. Do not use ### Column: name — the parser will treat Column: name as the column name.
Column keyword allowlist
The parser validates each **Keyword**: against a fixed allowlist. Unknown keywords produce warnings; common typos produce errors with auto-corrections.
Parsed and stored:
| Keyword | Format | Notes |
|---|---|---|
**Type**: |
flat | The user type (lowercase). Defaults to text if omitted. See Type system. |
**Description**: |
flat | Free text. |
**Units**: |
flat | Measurement units (e.g. kg, USD, years). Drives type inference for numeric types. |
**Constraints**: |
bulleted list of key: value |
See Constraints. |
**Enumeration**: |
bulleted list or markdown table | See Enumeration. |
**Examples**: |
flat, comma-separated | Stored as sample data; commas split into a list. |
**Relationships**: |
flat | Free text describing relationships to other columns. |
**Business Rules**: |
flat | Free text. |
**Semantic Links**: |
bulleted list of URIs | Each URI becomes {'uri': '<uri>'}. |
**Reuse**: |
flat | Direct reference by component CT_ID. See Component reuse. |
**ReuseComponent**: |
flat, @Project:Label form |
See Component reuse. |
Allowed for documentation only (parser tolerates them but does not extract them as fields):
**Ontology Mappings**:, **Standard**:, **NIH CDE Source**:, **PHI Status**:, **Clinical Significance**:, **HL7 Security Classification**:, **Access Control Requirements**:, **De-identification Considerations**:, **Calculation Method**:, **Important Distinctions**:, **Important Notes**:
These remain in the column's surrounding text but are not pulled into named fields.
Common typo auto-corrections (errors)
The parser flags these as errors and suggests the corrected form:
| You wrote | Parser expects |
|---|---|
**Values**:, **Value**:, **Enumerations**:, **Allowed Values**:, **Options**: |
**Enumeration**: |
**Unit**: |
**Units**: |
**DataType**:, **Data Type**: |
**Type**: |
**Example**: |
**Examples**: |
**Constraint**: |
**Constraints**: |
**Business Rule**: |
**Business Rules**: |
**Relationship**: |
**Relationships**: |
Type system
**Type**: accepts either an explicit SDC4 type or a user-friendly type. User-friendly types are mapped to SDC4 types using context (units, enumeration shape, name patterns, constraints).
Explicit SDC4 types
Lowercase and case-insensitive. Each maps 1:1 to a canonical type name:
| You write | Maps to |
|---|---|
xdstring |
XdString |
xdtoken |
XdToken |
xdcount |
XdCount |
xdordinal |
XdOrdinal |
xdquantity |
XdQuantity |
xdfloat |
XdFloat |
xddouble |
XdDouble |
xdboolean |
XdBoolean |
xdtemporal |
XdTemporal |
xdlink |
XdLink |
xdfile |
XdFile |
xdinterval |
XdInterval |
When you write an explicit SDC4 type the parser uses it as-is and skips inference.
User-friendly types
| You write | Maps to (no other context) |
|---|---|
text, string, varchar, char |
XdString |
integer, int, whole number, count |
XdCount |
decimal, float, number, numeric, double |
XdFloat |
boolean, bool, flag |
XdBoolean |
date, datetime, timestamp, time |
XdTemporal |
identifier, id, uuid, guid |
XdString |
email |
XdString |
url, uri, link |
XdLink |
| anything else | XdString (with a warning) |
Type inference order
Context can override the default user-type mapping. The parser checks rules in this order; the first match wins:
- Explicit SDC4 type (above) — used as-is.
- Enumeration shape — see Enumeration-driven inference.
text+ any enumeration →XdToken.- Name patterns — see Name-pattern inference.
- User type + context (
integer+units,decimal+precision, etc.). - Default for the user type.
Enumeration-driven inference
When **Enumeration**: is present, the parser inspects the values:
| Enumeration shape | Result |
|---|---|
Values match a boolean set: {yes,no}, {true,false}, {present,absent}, {positive,negative}, {active,inactive}, {enabled,disabled} (with optional unknown / n/a) |
XdBoolean |
Values match a known ordinal pattern: severity scales (none/mild/moderate/severe/critical), Likert scales (strongly disagree…strongly agree), grade scales, frequency scales, risk scales, satisfaction scales, maturity scales |
XdOrdinal |
Values are consecutive integers (e.g. 0,1,2,3 or 1,2,3,4) |
XdOrdinal |
Values match Level N / Stage N / Phase N / Grade N / Step N / Tier N |
XdOrdinal |
If **Type**: text and an enumeration that does not match the patterns above, the result is XdToken.
If **Type**: xdboolean is set explicitly and an enumeration is present, the parser raises an error: XdBoolean cannot have an enumeration.
Name-pattern inference
Name-pattern rules apply when no explicit SDC4 type and no enumeration override is present:
| Column name pattern | Result |
|---|---|
ends with _uri or _url, or is exactly uri/url |
XdLink |
contains _id, _code, identifier, _key, _uuid, _guid |
XdString |
contains _link, website, homepage |
XdLink |
starts with is_, has_, can_, or ends with _flag |
XdBoolean |
contains year (with **Type**: integer/int/whole number/count/text/string) |
XdTemporal |
contains _time_, elapsed, latency, duration, processing_time, response_time |
XdTemporal |
contains _date, _time, _at, timestamp, created, updated, modified, deleted |
XdTemporal |
Numeric context inference
When the user wrote a numeric type and no override above applied:
| User type | Context | Result |
|---|---|---|
integer family |
enumeration present | XdOrdinal |
integer family |
units set |
XdCount |
integer family |
range constraint set |
XdCount |
integer family |
otherwise | XdCount |
decimal family |
units set |
XdQuantity |
decimal family |
precision: 2 |
XdQuantity (currency-like) |
decimal family |
precision: 10+ |
XdDouble |
decimal family |
otherwise | XdFloat |
Constraints
**Constraints**:
- required: true
- range: [0, 120]
The parser parses any key: value pair into a constraints dictionary. Of those, only some are read back to populate column fields:
| Key | Value | Effect |
|---|---|---|
required |
true / false |
Sets column.nullability to required or optional. |
range |
[min, max] (a list of exactly two values) |
Sets column.range_values to "min to max". |
precision |
integer | Affects type inference for decimal user types (==2 → XdQuantity, >=10 → XdDouble). |
Other constraint keys (min_value, max_value, format, unique, pattern, etc.) are accepted syntactically and stored in the constraints dictionary, but the parser does not currently use them to populate any structured column field. To express a numeric range, use range: [min, max] rather than separate min_value / max_value keys.
Enumeration
Two formats are supported. Both produce the same internal structure with Value, Label, and Description columns.
Bulleted list (recommended for short lists)
**Enumeration**:
- active: Account in good standing
- suspended: Temporarily suspended
- closed: Permanently closed
Each bullet is value: label or value: label: annotation. If the annotation is a URL it is stored as Description. Without a colon, the bullet text becomes both the value and the title-cased label.
You may also embed semantic URLs directly:
**Enumeration**:
- 0: Public: https://w3c.github.io/dpv/dpv/#PubliclyAvailable
- 1: Internal: https://w3c.github.io/dpv/dpv/#InternalUse
- 2: Restricted: https://w3c.github.io/dpv/dpv/#RestrictedAccess
Markdown table (recommended for richer descriptions)
**Enumeration**:
| Value | Label | Description |
|-------|-------|-------------|
| 1 | Active | Account in good standing |
| 2 | Suspended | Temporarily suspended |
| 3 | Closed | Permanently closed |
Any column headers are accepted; the parser preserves them as keys on each row. Internally, ordinal/boolean detection looks at the Value column.
Component reuse
There are two forms. Use one per column.
### state
**Type**: text
**ReuseComponent**: @NIEM:StateUSPostalServiceCode
### gender
**Type**: text
**Reuse**: ct_abc123xyz
| Keyword | Form | When to use |
|---|---|---|
**ReuseComponent**: |
@ProjectName:ComponentLabel (must start with @ and contain :) |
New, recommended. |
**Reuse**: |
bare component CT_ID | Direct ID reference; legacy. |
A **ReuseComponent**: value missing the leading @ produces a warning; missing the : separator or empty project / label produces an error.
Validation behavior
Phase 1 validation classifies issues into three buckets:
Errors (template is rejected):
- Missing or non-4.x template_version.
- No ## Data: section.
- A column has no **Type**: parsed and no recognizable type-bearing context (rare).
- An invalid keyword that has a known correction (e.g. **Values**:).
- An invalid **ReuseComponent**: format.
- XdBoolean with an enumeration.
Warnings (template still parses):
- An unrecognized but plausible-looking keyword (e.g. **Type:** typo variants not in the corrections table).
- A **ReuseComponent**: value that does not start with @.
Suggestions (advisory only):
- An integer column with no units and no enumeration.
- A decimal column with no units.
- text with an enumeration (consider XdToken).
Phase 2: optional LLM enrichment
If enrichment.enable_llm: true (the default) and Phase 1 succeeds, the parser runs a second pass that may:
- Add semantic-link URIs (when
**Semantic Links**:is empty). - Augment short descriptions.
- Suggest a different SDC4 type via embeddings (recorded as a metadata note, never silently overwriting your declared type).
Phase 2 failures are non-fatal — they are logged but do not invalidate the template.
To skip Phase 2 entirely, set enrichment.enable_llm: false in the front matter.
Minimal example
---
template_version: "4.0.0"
dataset:
name: "Customer Records"
description: "Customer master data."
creator: "Acme Corp"
enrichment:
enable_llm: true
---
# Dataset Overview
Master customer records for use across all Acme business systems.
**Purpose**: Single source of truth for customer identity.
**Business Context**: Used by billing, support, and marketing systems.
## Data: Customer
**Rules**:
- At least one of email or phone must be provided
### customer_id
**Type**: text
**Description**: Globally unique customer identifier.
**Constraints**:
- required: true
### full_name
**Type**: text
**Description**: Customer's full name.
**Constraints**:
- required: true
### signup_date
**Type**: date
**Description**: When the customer first registered.
### account_status
**Type**: text
**Description**: Current account status.
**Enumeration**:
- active: Account in good standing
- suspended: Temporarily suspended
- closed: Permanently closed
### lifetime_value_usd
**Type**: decimal
**Units**: USD
**Description**: Total revenue from this customer.
**Constraints**:
- precision: 2
This template parses cleanly and produces a single Data cluster with five columns: customer_id (XdString), full_name (XdString), signup_date (XdTemporal), account_status (XdToken — text + enumeration), and lifetime_value_usd (XdQuantity — decimal + units + currency-like precision).