Markdown Template Format Reference

This is the authoritative reference for the Markdown template format consumed by SDCStudio's md2pd parser. It mirrors the parser's actual behavior; if the parser and this document disagree, the parser is correct and this document is a bug.

The parser is implemented at src/md2pd/agents/template_parser_agent.py and orchestrated by src/md2pd/core/workflow_controller.py.

File anatomy

---
<YAML front matter>
---

# Dataset Overview

<prose plus optional **Purpose** / **Business Context** keywords>

## <NamedTree>: <Name>
...

A template has three pieces: YAML front matter, an optional # Dataset Overview H1 section, and one or more H2 named-tree sections. At least one ## Data: section is required; everything else is optional.

YAML front matter

---
template_version: "4.0.0"
dataset:
  name: "Dataset Name"
  description: "What the dataset is for."
  creator: "Author or Organization"
  project: "project_ct_id"
enrichment:
  enable_llm: true
---

Key	Required	Type	Notes
`template_version`	Yes	string	Major version must be `4` (e.g. `"4.0.0"`, `"4.2.1"`). Anything else fails validation.
`dataset.name`	No	string	Falls back to the file stem if omitted.
`dataset.description`	No	string	Stored on the parsed model.
`dataset.creator`	No	string	Stored as `dm_creator`.
`dataset.project`	No	string	Stored as `dm_project`.
`enrichment.enable_llm`	No	bool	Default: `true` (overridable via the `MD2PD_DEFAULT_ENRICHMENT` setting). When true, an optional Phase 2 LLM enrichment runs after Phase 1 parsing succeeds.

Other top-level YAML keys are ignored.

`# Dataset Overview` (H1, optional)

A single H1 named exactly Dataset Overview. Free prose is allowed. Two bold keywords are extracted if present:

Keyword	Stored as
`Purpose:`	Overrides `dataset.description`.
`Business Context:`	Stored in `semantic_context.business_context`.

All other prose in this section is informational and not parsed.

Named-tree sections (H2)

The parser recognizes exactly eight H2 section types, each in the form ## <Type>: <Name>. The colon and name are required for all types except Links. Sections not in this list are not parsed.

Type	Multiplicity	Purpose
`Data`	1+ required	Defines a flat cluster of components.
`Workflow`	optional, multiple	Same shape as `Data` (recognized but currently not stored in the parsed output — see note below).
`Subject`	optional, max 1	The data subject as a Party.
`Provider`	optional, max 1	The data provider as a Party.
`Participation`	optional, multiple	A Participation with a performer Party.
`Attestation`	optional, max 1	An Attestation node (View / Proof / Reason).
`Audit`	optional, multiple	Audit metadata.
`Links`	optional, max 1	External URI references.

Note on multiple ## Data: sections. If you write more than one ## Data: section, only the last one parsed is retained as the cluster hierarchy; earlier ones are discarded. Use a single ## Data: per template.

Note on ## Workflow:. The parser recognizes this section and parses its contents, but the resulting Workflow cluster is not currently included in the parsed output structure. Use ## Data: for content you need to round-trip.

`## Data: <Name>`

The required cluster section. Contains columns and optional cluster-level keywords.

## Data: Patient Demographics

Brief prose description of the cluster.

**Purpose**: What this cluster represents
**Business Context**: How it's used
**Rules**:
  - end_date must be after start_date
  - At least one of email or phone must be provided

### full_name
**Type**: text
**Description**: Patient's full name.

### date_of_birth
**Type**: date
**Description**: Patient's date of birth.

Cluster-level keywords (all optional):

Keyword	Format	Effect
`Purpose:`	flat	Stored as `cluster.purpose`.
`Business Context:`	flat	Stored as `cluster.business_context`.
`Rules:`	bulleted list	Cross-field validation rules; stored as `cluster.rules: List[str]`. Translated to assertion expressions during downstream XSD generation.

The first paragraph of prose (before any **Keyword**: or ### column) becomes the cluster description.

`## Subject: <Name>` and `## Provider: <Name>`

Each captures a Party. Recognized fields:

## Subject: Patient

**Description**: Person whose health record this is.

### patient_id
**Type**: text
**Description**: Medical record number.

Keyword	Effect
`Description:`	Stored on the PartyNode.
`### col_name`	Columns belonging to this Party.

## Subject: sets role='subject'; ## Provider: sets role='provider'. Multiple ## Subject: or ## Provider: sections are not supported (only one of each is retained).

`## Participation: <Name>`

## Participation: Examining Clinician

**Description**: Clinician who performed the examination.
**Function**: Examiner
**Function Description**: Performs and documents physical examination.
**Mode**: present
**Mode Description**: In-person examination.

### clinician_id
**Type**: text
**Description**: Clinician identifier.

Keyword	Effect
`Description:`	Stored on the ParticipationNode.
`Function:`	The participation function label.
`Function Description:`	The function description.
`Mode:`	The participation mode label.
`Mode Description:`	The mode description.
`### col_name`	Columns become an inline performer Party (role='performer').

## Participation: may appear multiple times.

`## Attestation: <Name>`

## Attestation: Encounter Sign-off

- **View**: Clinical Summary (`application/pdf`, content_mode=`url`)
- **Proof**: Clinician Signature (`application/pgp-signature`, content_mode=`embed`)
- **Reason**: Attestation Reason (XdString, min_length=10, max_length=500)

Each line uses the form **Field**: label (parenthetical). The parenthetical is optional but, when present, is parsed as follows:

Field	Parenthetical contents
`View:`	Type (e.g. `XdFile`), backtick-quoted media type (`application/pdf`), `content_mode=url` or `content_mode=embed`.
`Proof:`	Same as View.
`Reason:`	Type (e.g. `XdString`), `min_length=N`, `max_length=N`. Default min/max = 1 / 500 if omitted.

Order within the parenthetical does not matter; the parser extracts each piece by regex.

`## Audit: <Name>`

## Audit: System Provenance

**System ID**: ehr-prod-01
**System User**: clinician_jdoe
**Location**: Porto Sereno General Hospital

Audit sections capture flat metadata and are stored in the model's dm_audit list. Recognized fields: **System ID**:, **System User**:, **Location**:. Multiple ## Audit: sections are allowed.

`## Links:`

## Links:

- https://www.w3.org/TR/prov-o/
- urn:oid:2.16.840.1.113883.4.1
- https://semanticdatacharter.com/ns/sdc4/

A bulleted list of URIs. Lines that begin with http, https, or urn: are stored in the model's dm_links list. Other content is ignored. The name after the colon is optional.

Column format (`### name`)

Inside any section that contains columns (Data, Workflow, Subject, Provider, Participation), each column is introduced by an H3:

### column_name
**Type**: text
**Description**: ...
**Constraints**:
  - required: true
**Examples**: example1, example2

The text after ### is the column name verbatim. Do not use ### Column: name — the parser will treat Column: name as the column name.

Column keyword allowlist

The parser validates each **Keyword**: against a fixed allowlist. Unknown keywords produce warnings; common typos produce errors with auto-corrections.

Parsed and stored:

Keyword	Format	Notes
`Type:`	flat	The user type (lowercase). Defaults to `text` if omitted. See Type system.
`Description:`	flat	Free text.
`Units:`	flat	Measurement units (e.g. `kg`, `USD`, `years`). Drives type inference for numeric types.
`Constraints:`	bulleted list of `key: value`	See Constraints.
`Enumeration:`	bulleted list or markdown table	See Enumeration.
`Examples:`	flat, comma-separated	Stored as sample data; commas split into a list.
`Relationships:`	flat	Free text describing relationships to other columns.
`Business Rules:`	flat	Free text.
`Semantic Links:`	bulleted list of URIs	Each URI becomes `{'uri': '<uri>'}`.
`Reuse:`	flat	Direct reference by component CT_ID. See Component reuse.
`ReuseComponent:`	flat, `@Project:Label` form	See Component reuse.

Allowed for documentation only (parser tolerates them but does not extract them as fields):

**Ontology Mappings**:, **Standard**:, **NIH CDE Source**:, **PHI Status**:, **Clinical Significance**:, **HL7 Security Classification**:, **Access Control Requirements**:, **De-identification Considerations**:, **Calculation Method**:, **Important Distinctions**:, **Important Notes**:

These remain in the column's surrounding text but are not pulled into named fields.

Common typo auto-corrections (errors)

The parser flags these as errors and suggests the corrected form:

You wrote	Parser expects
`Values:`, `Value:`, `Enumerations:`, `Allowed Values:`, `Options:`	`Enumeration:`
`Unit:`	`Units:`
`DataType:`, `Data Type:`	`Type:`
`Example:`	`Examples:`
`Constraint:`	`Constraints:`
`Business Rule:`	`Business Rules:`
`Relationship:`	`Relationships:`

Type system

**Type**: accepts either an explicit SDC4 type or a user-friendly type. User-friendly types are mapped to SDC4 types using context (units, enumeration shape, name patterns, constraints).

Explicit SDC4 types

Lowercase and case-insensitive. Each maps 1:1 to a canonical type name:

You write	Maps to
`xdstring`	`XdString`
`xdtoken`	`XdToken`
`xdcount`	`XdCount`
`xdordinal`	`XdOrdinal`
`xdquantity`	`XdQuantity`
`xdfloat`	`XdFloat`
`xddouble`	`XdDouble`
`xdboolean`	`XdBoolean`
`xdtemporal`	`XdTemporal`
`xdlink`	`XdLink`
`xdfile`	`XdFile`
`xdinterval`	`XdInterval`

When you write an explicit SDC4 type the parser uses it as-is and skips inference.

User-friendly types

You write	Maps to (no other context)
`text`, `string`, `varchar`, `char`	`XdString`
`integer`, `int`, `whole number`, `count`	`XdCount`
`decimal`, `float`, `number`, `numeric`, `double`	`XdFloat`
`boolean`, `bool`, `flag`	`XdBoolean`
`date`, `datetime`, `timestamp`, `time`	`XdTemporal`
`identifier`, `id`, `uuid`, `guid`	`XdString`
`email`	`XdString`
`url`, `uri`, `link`	`XdLink`
anything else	`XdString` (with a warning)

Type inference order

Context can override the default user-type mapping. The parser checks rules in this order; the first match wins:

Explicit SDC4 type (above) — used as-is.
Enumeration shape — see Enumeration-driven inference.
text + any enumeration → XdToken.
Name patterns — see Name-pattern inference.
User type + context (integer + units, decimal + precision, etc.).
Default for the user type.

Enumeration-driven inference

When **Enumeration**: is present, the parser inspects the values:

Enumeration shape	Result
Values match a boolean set: `{yes,no}`, `{true,false}`, `{present,absent}`, `{positive,negative}`, `{active,inactive}`, `{enabled,disabled}` (with optional `unknown` / `n/a`)	`XdBoolean`
Values match a known ordinal pattern: severity scales (`none/mild/moderate/severe/critical`), Likert scales (`strongly disagree`…`strongly agree`), grade scales, frequency scales, risk scales, satisfaction scales, maturity scales	`XdOrdinal`
Values are consecutive integers (e.g. `0,1,2,3` or `1,2,3,4`)	`XdOrdinal`
Values match `Level N` / `Stage N` / `Phase N` / `Grade N` / `Step N` / `Tier N`	`XdOrdinal`

If **Type**: text and an enumeration that does not match the patterns above, the result is XdToken.

If **Type**: xdboolean is set explicitly and an enumeration is present, the parser raises an error: XdBoolean cannot have an enumeration.

Name-pattern inference

Name-pattern rules apply when no explicit SDC4 type and no enumeration override is present:

Column name pattern	Result
ends with `_uri` or `_url`, or is exactly `uri`/`url`	`XdLink`
contains `_id`, `_code`, `identifier`, `_key`, `_uuid`, `_guid`	`XdString`
contains `_link`, `website`, `homepage`	`XdLink`
starts with `is_`, `has_`, `can_`, or ends with `_flag`	`XdBoolean`
contains `year` (with `Type: integer`/`int`/`whole number`/`count`/`text`/`string`)	`XdTemporal`
contains `_time_`, `elapsed`, `latency`, `duration`, `processing_time`, `response_time`	`XdTemporal`
contains `_date`, `_time`, `_at`, `timestamp`, `created`, `updated`, `modified`, `deleted`	`XdTemporal`

Numeric context inference

When the user wrote a numeric type and no override above applied:

User type	Context	Result
`integer` family	enumeration present	`XdOrdinal`
`integer` family	`units` set	`XdCount`
`integer` family	`range` constraint set	`XdCount`
`integer` family	otherwise	`XdCount`
`decimal` family	`units` set	`XdQuantity`
`decimal` family	`precision: 2`	`XdQuantity` (currency-like)
`decimal` family	`precision: 10+`	`XdDouble`
`decimal` family	otherwise	`XdFloat`

Constraints

**Constraints**:
  - required: true
  - range: [0, 120]

The parser parses any key: value pair into a constraints dictionary. Of those, only some are read back to populate column fields:

Key	Value	Effect
`required`	`true` / `false`	Sets `column.nullability` to `required` or `optional`.
`range`	`[min, max]` (a list of exactly two values)	Sets `column.range_values` to `"min to max"`.
`precision`	integer	Affects type inference for `decimal` user types (`==2` → `XdQuantity`, `>=10` → `XdDouble`).

Other constraint keys (min_value, max_value, format, unique, pattern, etc.) are accepted syntactically and stored in the constraints dictionary, but the parser does not currently use them to populate any structured column field. To express a numeric range, use range: [min, max] rather than separate min_value / max_value keys.

Enumeration

Two formats are supported. Both produce the same internal structure with Value, Label, and Description columns.

Bulleted list (recommended for short lists)

**Enumeration**:
  - active: Account in good standing
  - suspended: Temporarily suspended
  - closed: Permanently closed

Each bullet is value: label or value: label: annotation. If the annotation is a URL it is stored as Description. Without a colon, the bullet text becomes both the value and the title-cased label.

You may also embed semantic URLs directly:

**Enumeration**:
  - 0: Public: https://w3c.github.io/dpv/dpv/#PubliclyAvailable
  - 1: Internal: https://w3c.github.io/dpv/dpv/#InternalUse
  - 2: Restricted: https://w3c.github.io/dpv/dpv/#RestrictedAccess

Markdown table (recommended for richer descriptions)

**Enumeration**:

| Value | Label | Description |
|-------|-------|-------------|
| 1 | Active | Account in good standing |
| 2 | Suspended | Temporarily suspended |
| 3 | Closed | Permanently closed |

Any column headers are accepted; the parser preserves them as keys on each row. Internally, ordinal/boolean detection looks at the Value column.

Component reuse

There are two forms. Use one per column.

### state
**Type**: text
**ReuseComponent**: @NIEM:StateUSPostalServiceCode

### gender
**Type**: text
**Reuse**: ct_abc123xyz

Keyword	Form	When to use
`ReuseComponent:`	`@ProjectName:ComponentLabel` (must start with `@` and contain `:`)	New, recommended.
`Reuse:`	bare component CT_ID	Direct ID reference; legacy.

A **ReuseComponent**: value missing the leading @ produces a warning; missing the : separator or empty project / label produces an error.

Validation behavior

Phase 1 validation classifies issues into three buckets:

Errors (template is rejected): - Missing or non-4.x template_version. - No ## Data: section. - A column has no **Type**: parsed and no recognizable type-bearing context (rare). - An invalid keyword that has a known correction (e.g. **Values**:). - An invalid **ReuseComponent**: format. - XdBoolean with an enumeration.

Warnings (template still parses): - An unrecognized but plausible-looking keyword (e.g. **Type:** typo variants not in the corrections table). - A **ReuseComponent**: value that does not start with @.

Suggestions (advisory only): - An integer column with no units and no enumeration. - A decimal column with no units. - text with an enumeration (consider XdToken).

Phase 2: optional LLM enrichment

If enrichment.enable_llm: true (the default) and Phase 1 succeeds, the parser runs a second pass that may:

Add semantic-link URIs (when **Semantic Links**: is empty).
Augment short descriptions.
Suggest a different SDC4 type via embeddings (recorded as a metadata note, never silently overwriting your declared type).

Phase 2 failures are non-fatal — they are logged but do not invalidate the template.

To skip Phase 2 entirely, set enrichment.enable_llm: false in the front matter.

Minimal example

---
template_version: "4.0.0"
dataset:
  name: "Customer Records"
  description: "Customer master data."
  creator: "Acme Corp"
enrichment:
  enable_llm: true
---

# Dataset Overview

Master customer records for use across all Acme business systems.

**Purpose**: Single source of truth for customer identity.
**Business Context**: Used by billing, support, and marketing systems.

## Data: Customer

**Rules**:
  - At least one of email or phone must be provided

### customer_id
**Type**: text
**Description**: Globally unique customer identifier.
**Constraints**:
  - required: true

### full_name
**Type**: text
**Description**: Customer's full name.
**Constraints**:
  - required: true

### signup_date
**Type**: date
**Description**: When the customer first registered.

### account_status
**Type**: text
**Description**: Current account status.
**Enumeration**:
  - active: Account in good standing
  - suspended: Temporarily suspended
  - closed: Permanently closed

### lifetime_value_usd
**Type**: decimal
**Units**: USD
**Description**: Total revenue from this customer.
**Constraints**:
  - precision: 2

This template parses cleanly and produces a single Data cluster with five columns: customer_id (XdString), full_name (XdString), signup_date (XdTemporal), account_status (XdToken — text + enumeration), and lifetime_value_usd (XdQuantity — decimal + units + currency-like precision).

You wrote	Parser expects
`Values:`, `Value:`, `Enumerations:`, `Allowed Values:`, `Options:`	`Enumeration:`
`Unit:`	`Units:`
`DataType:`, `Data Type:`	`Type:`
`Example:`	`Examples:`
`Constraint:`	`Constraints:`
`Business Rule:`	`Business Rules:`
`Relationship:`	`Relationships:`