Uploading Data and AI Processing

Reading time: ~7 minutes | Level: Beginner Prerequisites: Tutorial 2: SDCStudio Overview

What You'll Learn

  • Which file formats SDCStudio accepts
  • How to upload a file and monitor its progress
  • What happens during Stage 1 (structural parsing) and Stage 2 (AI enhancement)
  • How automatic string format inference works
  • Tips for getting the best results from AI processing

Supported File Formats

SDCStudio currently accepts two file formats:

CSV (recommended for tabular data): - Must have a header row with column names - UTF-8 encoding required - Comma-separated values - Best for: spreadsheets, database exports, structured datasets

Markdown (for documentation-driven modeling): - Structured with headers and sections - Follows SDC template format (Form2SDCTemplate or SDCObsidianTemplate) - Best for: specifications, data dictionaries, domain models

Other formats (JSON, XML, PDF, DOCX, Excel) are not currently supported. For Excel files, export to CSV first.

How to Upload

Upload page with drag-and-drop zone and format instructions

  1. Open your project and click the Data Sources tab
  2. Click Upload Data
  3. Select your CSV or Markdown file (maximum 50 MB, recommended under 10 MB)
  4. Click Upload

Your file appears in the Data Sources list with a status badge that updates in real-time.

The Two-Stage Pipeline

SDCStudio processes your file in two distinct stages. This design means that if AI processing fails, you can retry without re-uploading or re-parsing the file.

Stage 1: Structural Parsing

Data source showing UPLOADED status badge

Data source showing PARSING status with progress indicator

Duration: 30 seconds to 2 minutes Status flow: UPLOADEDPARSINGPARSED

This stage is fast and deterministic — no AI or LLM calls. The system:

  1. Detects the file format (CSV vs Markdown)
  2. Reads the structure — headers, columns, row counts
  3. Infers basic data types — integer, decimal, string, date, boolean
  4. Detects string format patterns automatically (see below)
  5. Creates a ParsedData record — a standardized representation stored in the database

For a CSV file, the output at this stage is a list of columns with detected types and sample values:

customer_id     → integer
first_name      → string
email           → string (email pattern detected)
signup_date     → date (ISO 8601)
status          → string (categorical: 2 unique values)
total_purchases → integer
price           → decimal

Automatic String Format Inference

During parsing, the system scans sample values in each column and detects common patterns without using AI. Detected formats include:

  • Email addresses
  • UUIDs
  • IPv4 addresses
  • US ZIP codes (5-digit and ZIP+4)
  • Phone numbers
  • ISO dates
  • URLs
  • SSNs
  • MAC addresses
  • Hex color codes

When a pattern is found in the majority of sample values, the system pre-populates an XML Schema-compatible regex for that column. This saves manual configuration and improves the validation rules that the AI builds on in Stage 2.

Stage 2: AI Enhancement

Data source showing COMPLETED status with generated model link

Duration: 1 to 5 minutes Status flow: PARSEDAGENT_PROCESSINGCOMPLETED

This is where the intelligence happens. Multiple AI agents work together:

  1. DataModelAgent — creates the top-level model and coordinates other agents
  2. ClusterAgent — organizes components into logical groups
  3. Type-specific agents — create SDC4 components for each column:
  4. Semantic analysis: understands what each column represents
  5. Pattern recognition: refines format detection from Stage 1
  6. Ontology matching: links to your uploaded ontologies and built-in vocabularies
  7. Validation rules: suggests constraints (min/max, patterns, enumerations, required/optional)
  8. Labels and descriptions: generates human-readable names and documentation

The AI uses your uploaded ontologies during this stage. If you uploaded a healthcare ontology, an mrn column gets mapped to FHIR Patient.identifier. If you uploaded a product vocabulary, a sku column gets matched to your product codes. This is why uploading ontologies before data files matters.

Status Reference

Status Badge Color Meaning Action
UPLOADED Blue File received Automatic — parsing starts
PARSING Blue Reading structure Wait
PARSED Blue Structure analyzed Automatic — AI starts
AGENT_PROCESSING Yellow AI creating components Wait
COMPLETED Green Model ready Review your model
AGENT_ERROR Red AI failed Click Retry
ERROR Red File processing failed Check file format, re-upload

Tips for Best Results

Use descriptive column names: - Good: customer_email, order_total_amount, signup_date - Bad: col1, data, field_x

Keep data consistent: - Same date format throughout a column (don't mix 2024-01-15 with 1/15/24) - Same type per column (don't mix numbers and text)

Include enough rows: - Minimum 10 rows for reliable type inference - 20-100 rows is ideal for pattern detection - Include edge cases (min values, max values, empty cells)

Upload ontologies first: - Go to Settings → Ontologies before your first data upload - Standard ontologies (FHIR, SNOMED, schema.org) are already built in - Upload your organization's custom vocabularies

Start small: - Upload a sample (50-100 rows) first - Validate the generated model - Upload the full dataset after you are satisfied

What to Do After Processing

Once your file shows COMPLETED:

  1. Navigate to Data Models in your project
  2. Click the generated model
  3. Review the structure, types, and validation rules
  4. Edit any component that needs refinement
  5. When satisfied, publish and generate outputs

Summary

  • SDCStudio accepts CSV and Markdown files
  • Processing happens in two stages: fast structural parsing, then AI enhancement
  • The AI creates a complete model with types, validation, labels, and semantic links
  • Uploading ontologies before data files improves AI quality
  • If AI processing fails, click Retry — no re-upload needed

Next Tutorial

Understanding Components and Data Types — Learn about the SDC4 type system and how to interpret the components the AI created for your data.