Markdown Template Format Reference

This is the authoritative reference for the Markdown template format consumed by SDCStudio's md2pd parser. It mirrors the parser's actual behavior; if the parser and this document disagree, the parser is correct and this document is a bug.

The parser is implemented at src/md2pd/agents/template_parser_agent.py and orchestrated by src/md2pd/core/workflow_controller.py.

File anatomy

---
<YAML front matter>
---

# Dataset Overview

<prose plus optional **Purpose** / **Business Context** keywords>

## <NamedTree>: <Name>
...

A template has three pieces: YAML front matter, an optional # Dataset Overview H1 section, and one or more H2 named-tree sections. At least one ## Data: section is required; everything else is optional.

YAML front matter

---
template_version: "4.0.0"
dataset:
  name: "Dataset Name"
  description: "What the dataset is for."
  creator: "Author or Organization"
  project: "project_ct_id"
enrichment:
  enable_llm: true
---
Key Required Type Notes
template_version Yes string Major version must be 4 (e.g. "4.0.0", "4.2.1"). Anything else fails validation.
dataset.name No string Falls back to the file stem if omitted.
dataset.description No string Stored on the parsed model.
dataset.creator No string Stored as dm_creator.
dataset.project No string Stored as dm_project.
enrichment.enable_llm No bool Default: true (overridable via the MD2PD_DEFAULT_ENRICHMENT setting). When true, an optional Phase 2 LLM enrichment runs after Phase 1 parsing succeeds.

Other top-level YAML keys are ignored.

# Dataset Overview (H1, optional)

A single H1 named exactly Dataset Overview. Free prose is allowed. Two bold keywords are extracted if present:

Keyword Stored as
**Purpose**: Overrides dataset.description.
**Business Context**: Stored in semantic_context.business_context.

All other prose in this section is informational and not parsed.

Named-tree sections (H2)

The parser recognizes exactly eight H2 section types, each in the form ## <Type>: <Name>. The colon and name are required for all types except Links. Sections not in this list are not parsed.

Type Multiplicity Purpose
Data 1+ required Defines a flat cluster of components.
Workflow optional, multiple Same shape as Data (recognized but currently not stored in the parsed output — see note below).
Subject optional, max 1 The data subject as a Party.
Provider optional, max 1 The data provider as a Party.
Participation optional, multiple A Participation with a performer Party.
Attestation optional, max 1 An Attestation node (View / Proof / Reason).
Audit optional, multiple Audit metadata.
Links optional, max 1 External URI references.

Note on multiple ## Data: sections. If you write more than one ## Data: section, only the last one parsed is retained as the cluster hierarchy; earlier ones are discarded. Use a single ## Data: per template.

Note on ## Workflow:. The parser recognizes this section and parses its contents, but the resulting Workflow cluster is not currently included in the parsed output structure. Use ## Data: for content you need to round-trip.

## Data: <Name>

The required cluster section. Contains columns and optional cluster-level keywords.

## Data: Patient Demographics

Brief prose description of the cluster.

**Purpose**: What this cluster represents
**Business Context**: How it's used
**Rules**:
  - end_date must be after start_date
  - At least one of email or phone must be provided

### full_name
**Type**: text
**Description**: Patient's full name.

### date_of_birth
**Type**: date
**Description**: Patient's date of birth.

Cluster-level keywords (all optional):

Keyword Format Effect
**Purpose**: flat Stored as cluster.purpose.
**Business Context**: flat Stored as cluster.business_context.
**Rules**: bulleted list Cross-field validation rules; stored as cluster.rules: List[str]. Translated to assertion expressions during downstream XSD generation.

The first paragraph of prose (before any **Keyword**: or ### column) becomes the cluster description.

## Subject: <Name> and ## Provider: <Name>

Each captures a Party. Recognized fields:

## Subject: Patient

**Description**: Person whose health record this is.

### patient_id
**Type**: text
**Description**: Medical record number.
Keyword Effect
**Description**: Stored on the PartyNode.
### col_name Columns belonging to this Party.

## Subject: sets role='subject'; ## Provider: sets role='provider'. Multiple ## Subject: or ## Provider: sections are not supported (only one of each is retained).

## Participation: <Name>

## Participation: Examining Clinician

**Description**: Clinician who performed the examination.
**Function**: Examiner
**Function Description**: Performs and documents physical examination.
**Mode**: present
**Mode Description**: In-person examination.

### clinician_id
**Type**: text
**Description**: Clinician identifier.
Keyword Effect
**Description**: Stored on the ParticipationNode.
**Function**: The participation function label.
**Function Description**: The function description.
**Mode**: The participation mode label.
**Mode Description**: The mode description.
### col_name Columns become an inline performer Party (role='performer').

## Participation: may appear multiple times.

## Attestation: <Name>

## Attestation: Encounter Sign-off

- **View**: Clinical Summary (`application/pdf`, content_mode=`url`)
- **Proof**: Clinician Signature (`application/pgp-signature`, content_mode=`embed`)
- **Reason**: Attestation Reason (XdString, min_length=10, max_length=500)

Each line uses the form **Field**: label (parenthetical). The parenthetical is optional but, when present, is parsed as follows:

Field Parenthetical contents
**View**: Type (e.g. XdFile), backtick-quoted media type (`application/pdf`), content_mode=url or content_mode=embed.
**Proof**: Same as View.
**Reason**: Type (e.g. XdString), min_length=N, max_length=N. Default min/max = 1 / 500 if omitted.

Order within the parenthetical does not matter; the parser extracts each piece by regex.

## Audit: <Name>

## Audit: System Provenance

**System ID**: ehr-prod-01
**System User**: clinician_jdoe
**Location**: Porto Sereno General Hospital

Audit sections capture flat metadata and are stored in the model's dm_audit list. Recognized fields: **System ID**:, **System User**:, **Location**:. Multiple ## Audit: sections are allowed.

## Links:

- https://www.w3.org/TR/prov-o/
- urn:oid:2.16.840.1.113883.4.1
- https://semanticdatacharter.com/ns/sdc4/

A bulleted list of URIs. Lines that begin with http, https, or urn: are stored in the model's dm_links list. Other content is ignored. The name after the colon is optional.

Column format (### name)

Inside any section that contains columns (Data, Workflow, Subject, Provider, Participation), each column is introduced by an H3:

### column_name
**Type**: text
**Description**: ...
**Constraints**:
  - required: true
**Examples**: example1, example2

The text after ### is the column name verbatim. Do not use ### Column: name — the parser will treat Column: name as the column name.

Column keyword allowlist

The parser validates each **Keyword**: against a fixed allowlist. Unknown keywords produce warnings; common typos produce errors with auto-corrections.

Parsed and stored:

Keyword Format Notes
**Type**: flat The user type (lowercase). Defaults to text if omitted. See Type system.
**Description**: flat Free text.
**Units**: flat Measurement units (e.g. kg, USD, years). Drives type inference for numeric types.
**Constraints**: bulleted list of key: value See Constraints.
**Enumeration**: bulleted list or markdown table See Enumeration.
**Examples**: flat, comma-separated Stored as sample data; commas split into a list.
**Relationships**: flat Free text describing relationships to other columns.
**Business Rules**: flat Free text.
**Semantic Links**: bulleted list of URIs Each URI becomes {'uri': '<uri>'}.
**Reuse**: flat Direct reference by component CT_ID. See Component reuse.
**ReuseComponent**: flat, @Project:Label form See Component reuse.

Allowed for documentation only (parser tolerates them but does not extract them as fields):

**Ontology Mappings**:, **Standard**:, **NIH CDE Source**:, **PHI Status**:, **Clinical Significance**:, **HL7 Security Classification**:, **Access Control Requirements**:, **De-identification Considerations**:, **Calculation Method**:, **Important Distinctions**:, **Important Notes**:

These remain in the column's surrounding text but are not pulled into named fields.

Common typo auto-corrections (errors)

The parser flags these as errors and suggests the corrected form:

You wrote Parser expects
**Values**:, **Value**:, **Enumerations**:, **Allowed Values**:, **Options**: **Enumeration**:
**Unit**: **Units**:
**DataType**:, **Data Type**: **Type**:
**Example**: **Examples**:
**Constraint**: **Constraints**:
**Business Rule**: **Business Rules**:
**Relationship**: **Relationships**:

Type system

**Type**: accepts either an explicit SDC4 type or a user-friendly type. User-friendly types are mapped to SDC4 types using context (units, enumeration shape, name patterns, constraints).

Explicit SDC4 types

Lowercase and case-insensitive. Each maps 1:1 to a canonical type name:

You write Maps to
xdstring XdString
xdtoken XdToken
xdcount XdCount
xdordinal XdOrdinal
xdquantity XdQuantity
xdfloat XdFloat
xddouble XdDouble
xdboolean XdBoolean
xdtemporal XdTemporal
xdlink XdLink
xdfile XdFile
xdinterval XdInterval

When you write an explicit SDC4 type the parser uses it as-is and skips inference.

User-friendly types

You write Maps to (no other context)
text, string, varchar, char XdString
integer, int, whole number, count XdCount
decimal, float, number, numeric, double XdFloat
boolean, bool, flag XdBoolean
date, datetime, timestamp, time XdTemporal
identifier, id, uuid, guid XdString
email XdString
url, uri, link XdLink
anything else XdString (with a warning)

Type inference order

Context can override the default user-type mapping. The parser checks rules in this order; the first match wins:

  1. Explicit SDC4 type (above) — used as-is.
  2. Enumeration shape — see Enumeration-driven inference.
  3. text + any enumerationXdToken.
  4. Name patterns — see Name-pattern inference.
  5. User type + context (integer + units, decimal + precision, etc.).
  6. Default for the user type.

Enumeration-driven inference

When **Enumeration**: is present, the parser inspects the values:

Enumeration shape Result
Values match a boolean set: {yes,no}, {true,false}, {present,absent}, {positive,negative}, {active,inactive}, {enabled,disabled} (with optional unknown / n/a) XdBoolean
Values match a known ordinal pattern: severity scales (none/mild/moderate/severe/critical), Likert scales (strongly disagreestrongly agree), grade scales, frequency scales, risk scales, satisfaction scales, maturity scales XdOrdinal
Values are consecutive integers (e.g. 0,1,2,3 or 1,2,3,4) XdOrdinal
Values match Level N / Stage N / Phase N / Grade N / Step N / Tier N XdOrdinal

If **Type**: text and an enumeration that does not match the patterns above, the result is XdToken.

If **Type**: xdboolean is set explicitly and an enumeration is present, the parser raises an error: XdBoolean cannot have an enumeration.

Name-pattern inference

Name-pattern rules apply when no explicit SDC4 type and no enumeration override is present:

Column name pattern Result
ends with _uri or _url, or is exactly uri/url XdLink
contains _id, _code, identifier, _key, _uuid, _guid XdString
contains _link, website, homepage XdLink
starts with is_, has_, can_, or ends with _flag XdBoolean
contains year (with **Type**: integer/int/whole number/count/text/string) XdTemporal
contains _time_, elapsed, latency, duration, processing_time, response_time XdTemporal
contains _date, _time, _at, timestamp, created, updated, modified, deleted XdTemporal

Numeric context inference

When the user wrote a numeric type and no override above applied:

User type Context Result
integer family enumeration present XdOrdinal
integer family units set XdCount
integer family range constraint set XdCount
integer family otherwise XdCount
decimal family units set XdQuantity
decimal family precision: 2 XdQuantity (currency-like)
decimal family precision: 10+ XdDouble
decimal family otherwise XdFloat

Constraints

**Constraints**:
  - required: true
  - range: [0, 120]

The parser parses any key: value pair into a constraints dictionary. Of those, only some are read back to populate column fields:

Key Value Effect
required true / false Sets column.nullability to required or optional.
range [min, max] (a list of exactly two values) Sets column.range_values to "min to max".
precision integer Affects type inference for decimal user types (==2XdQuantity, >=10XdDouble).

Other constraint keys (min_value, max_value, format, unique, pattern, etc.) are accepted syntactically and stored in the constraints dictionary, but the parser does not currently use them to populate any structured column field. To express a numeric range, use range: [min, max] rather than separate min_value / max_value keys.

Enumeration

Two formats are supported. Both produce the same internal structure with Value, Label, and Description columns.

**Enumeration**:
  - active: Account in good standing
  - suspended: Temporarily suspended
  - closed: Permanently closed

Each bullet is value: label or value: label: annotation. If the annotation is a URL it is stored as Description. Without a colon, the bullet text becomes both the value and the title-cased label.

You may also embed semantic URLs directly:

**Enumeration**:
  - 0: Public: https://w3c.github.io/dpv/dpv/#PubliclyAvailable
  - 1: Internal: https://w3c.github.io/dpv/dpv/#InternalUse
  - 2: Restricted: https://w3c.github.io/dpv/dpv/#RestrictedAccess
**Enumeration**:

| Value | Label | Description |
|-------|-------|-------------|
| 1 | Active | Account in good standing |
| 2 | Suspended | Temporarily suspended |
| 3 | Closed | Permanently closed |

Any column headers are accepted; the parser preserves them as keys on each row. Internally, ordinal/boolean detection looks at the Value column.

Component reuse

There are two forms. Use one per column.

### state
**Type**: text
**ReuseComponent**: @NIEM:StateUSPostalServiceCode

### gender
**Type**: text
**Reuse**: ct_abc123xyz
Keyword Form When to use
**ReuseComponent**: @ProjectName:ComponentLabel (must start with @ and contain :) New, recommended.
**Reuse**: bare component CT_ID Direct ID reference; legacy.

A **ReuseComponent**: value missing the leading @ produces a warning; missing the : separator or empty project / label produces an error.

Validation behavior

Phase 1 validation classifies issues into three buckets:

Errors (template is rejected): - Missing or non-4.x template_version. - No ## Data: section. - A column has no **Type**: parsed and no recognizable type-bearing context (rare). - An invalid keyword that has a known correction (e.g. **Values**:). - An invalid **ReuseComponent**: format. - XdBoolean with an enumeration.

Warnings (template still parses): - An unrecognized but plausible-looking keyword (e.g. **Type:** typo variants not in the corrections table). - A **ReuseComponent**: value that does not start with @.

Suggestions (advisory only): - An integer column with no units and no enumeration. - A decimal column with no units. - text with an enumeration (consider XdToken).

Phase 2: optional LLM enrichment

If enrichment.enable_llm: true (the default) and Phase 1 succeeds, the parser runs a second pass that may:

  • Add semantic-link URIs (when **Semantic Links**: is empty).
  • Augment short descriptions.
  • Suggest a different SDC4 type via embeddings (recorded as a metadata note, never silently overwriting your declared type).

Phase 2 failures are non-fatal — they are logged but do not invalidate the template.

To skip Phase 2 entirely, set enrichment.enable_llm: false in the front matter.

Minimal example

---
template_version: "4.0.0"
dataset:
  name: "Customer Records"
  description: "Customer master data."
  creator: "Acme Corp"
enrichment:
  enable_llm: true
---

# Dataset Overview

Master customer records for use across all Acme business systems.

**Purpose**: Single source of truth for customer identity.
**Business Context**: Used by billing, support, and marketing systems.

## Data: Customer

**Rules**:
  - At least one of email or phone must be provided

### customer_id
**Type**: text
**Description**: Globally unique customer identifier.
**Constraints**:
  - required: true

### full_name
**Type**: text
**Description**: Customer's full name.
**Constraints**:
  - required: true

### signup_date
**Type**: date
**Description**: When the customer first registered.

### account_status
**Type**: text
**Description**: Current account status.
**Enumeration**:
  - active: Account in good standing
  - suspended: Temporarily suspended
  - closed: Permanently closed

### lifetime_value_usd
**Type**: decimal
**Units**: USD
**Description**: Total revenue from this customer.
**Constraints**:
  - precision: 2

This template parses cleanly and produces a single Data cluster with five columns: customer_id (XdString), full_name (XdString), signup_date (XdTemporal), account_status (XdToken — text + enumeration), and lifetime_value_usd (XdQuantity — decimal + units + currency-like precision).