Uploading Data and AI Processing

Reading time: ~7 minutes | Level: Beginner Prerequisites: Tutorial 2: SDCStudio Overview

What You'll Learn

Which file formats SDCStudio accepts
How to upload a file and monitor its progress
What happens during Stage 1 (structural parsing) and Stage 2 (AI enhancement)
How automatic string format inference works
Tips for getting the best results from AI processing

Supported File Formats

SDCStudio currently accepts two file formats:

CSV (recommended for tabular data): - Must have a header row with column names - UTF-8 encoding required - Comma-separated values - Best for: spreadsheets, database exports, structured datasets

Markdown (for documentation-driven modeling): - Structured with headers and sections - Follows SDC template format (Form2SDCTemplate or SDCObsidianTemplate) - Best for: specifications, data dictionaries, domain models

Other formats (JSON, XML, PDF, DOCX, Excel) are not currently supported. For Excel files, export to CSV first.

How to Upload

Upload page with drag-and-drop zone and format instructions

Open your project and click the Data Sources tab
Click Upload Data
Select your CSV or Markdown file (maximum 50 MB, recommended under 10 MB)
Click Upload

Your file appears in the Data Sources list with a status badge that updates in real-time.

The Two-Stage Pipeline

SDCStudio processes your file in two distinct stages. This design means that if AI processing fails, you can retry without re-uploading or re-parsing the file.

Stage 1: Structural Parsing

Data source showing UPLOADED status badge

Data source showing PARSING status with progress indicator

Duration: 30 seconds to 2 minutes Status flow: UPLOADED → PARSING → PARSED

This stage is fast and deterministic — no AI or LLM calls. The system:

Detects the file format (CSV vs Markdown)
Reads the structure — headers, columns, row counts
Infers basic data types — integer, decimal, string, date, boolean
Detects string format patterns automatically (see below)
Creates a ParsedData record — a standardized representation stored in the database

For a CSV file, the output at this stage is a list of columns with detected types and sample values:

customer_id     → integer
first_name      → string
email           → string (email pattern detected)
signup_date     → date (ISO 8601)
status          → string (categorical: 2 unique values)
total_purchases → integer
price           → decimal

Automatic String Format Inference

During parsing, the system scans sample values in each column and detects common patterns without using AI. Detected formats include:

Email addresses
UUIDs
IPv4 addresses
US ZIP codes (5-digit and ZIP+4)
Phone numbers
ISO dates
URLs
SSNs
MAC addresses
Hex color codes

When a pattern is found in the majority of sample values, the system pre-populates an XML Schema-compatible regex for that column. This saves manual configuration and improves the validation rules that the AI builds on in Stage 2.

Stage 2: AI Enhancement

Data source showing COMPLETED status with generated model link

Duration: 1 to 5 minutes Status flow: PARSED → AGENT_PROCESSING → COMPLETED

This is where the intelligence happens. Multiple AI agents work together:

DataModelAgent — creates the top-level model and coordinates other agents
ClusterAgent — organizes components into logical groups
Type-specific agents — create SDC4 components for each column:
Semantic analysis: understands what each column represents
Pattern recognition: refines format detection from Stage 1
Ontology matching: links to your uploaded ontologies and built-in vocabularies
Validation rules: suggests constraints (min/max, patterns, enumerations, required/optional)
Labels and descriptions: generates human-readable names and documentation

The AI uses your uploaded ontologies during this stage. If you uploaded a healthcare ontology, an mrn column gets mapped to FHIR Patient.identifier. If you uploaded a product vocabulary, a sku column gets matched to your product codes. This is why uploading ontologies before data files matters.

Status Reference

Status	Badge Color	Meaning	Action
`UPLOADED`	Blue	File received	Automatic — parsing starts
`PARSING`	Blue	Reading structure	Wait
`PARSED`	Blue	Structure analyzed	Automatic — AI starts
`AGENT_PROCESSING`	Yellow	AI creating components	Wait
`COMPLETED`	Green	Model ready	Review your model
`AGENT_ERROR`	Red	AI failed	Click Retry
`ERROR`	Red	File processing failed	Check file format, re-upload

Tips for Best Results

Use descriptive column names: - Good: customer_email, order_total_amount, signup_date - Bad: col1, data, field_x

Keep data consistent: - Same date format throughout a column (don't mix 2024-01-15 with 1/15/24) - Same type per column (don't mix numbers and text)

Include enough rows: - Minimum 10 rows for reliable type inference - 20-100 rows is ideal for pattern detection - Include edge cases (min values, max values, empty cells)

Upload ontologies first: - Go to Settings → Ontologies before your first data upload - Standard ontologies (FHIR, SNOMED, schema.org) are already built in - Upload your organization's custom vocabularies

Start small: - Upload a sample (50-100 rows) first - Validate the generated model - Upload the full dataset after you are satisfied

What to Do After Processing

Once your file shows COMPLETED:

Navigate to Data Models in your project
Click the generated model
Review the structure, types, and validation rules
Edit any component that needs refinement
When satisfied, publish and generate outputs

Summary

SDCStudio accepts CSV and Markdown files
Processing happens in two stages: fast structural parsing, then AI enhancement
The AI creates a complete model with types, validation, labels, and semantic links
Uploading ontologies before data files improves AI quality
If AI processing fails, click Retry — no re-upload needed

Next Tutorial

Understanding Components and Data Types — Learn about the SDC4 type system and how to interpret the components the AI created for your data.