Uploading and Processing Data

Overview

SDCStudio's intelligent data processing pipeline analyzes your uploaded files and automatically creates comprehensive data models. This guide explains how to upload files, understand the two-stage processing system, and get the best results from AI analysis.

Supported File Formats

SDCStudio currently supports two file formats for data upload:

CSV Files (Recommended for Tabular Data)

Best For: Tabular data, spreadsheets, database exports, structured datasets

Requirements: - Must have a header row with column names - UTF-8 encoding required - Comma-separated values - Consistent data types per column

Why CSV? - Fast Processing: Quick parsing and analysis - Clear Structure: Columns and types easily detected - Best AI Results: Excellent for type inference and validation - Universal Format: Works with any spreadsheet or database

Example:

product_id,product_name,price,in_stock,last_updated
101,Laptop,999.99,true,2024-10-01
102,Mouse,29.99,true,2024-10-02
103,Keyboard,79.99,false,2024-10-03

When to Use CSV: - Exporting data from databases - Spreadsheet data with clear columns - Large datasets with consistent structure - First-time users learning SDCStudio

Markdown Templates (Recommended for Data Models from Documentation)

Best For: Structured specifications, data dictionaries, documentation-driven development, domain models

Requirements: - Structured with clear headers and sections - Follows SDC template format - UTF-8 encoding - Consistent formatting

Why Markdown Templates? - Documentation First: Design your model in readable documentation - Human Readable: Easy to review and collaborate - Version Control: Works great with Git - Flexible: Complex hierarchies and relationships - Domain Modeling: Perfect for capturing domain knowledge

Template Repositories:

We provide two official Markdown template systems:

1. Form2SDCTemplate (Google Forms to SDC)

Repository: https://github.com/SemanticDataCharter/Form2SDCTemplate

Purpose: Convert Google Forms into SDC4-compliant data models

Best For: - Rapid data model prototyping - Collaborative model design with stakeholders - Survey-based data structures - Form-driven workflows

How It Works: 1. Create a Google Form with your data fields 2. Export to the SDC template format 3. Upload Markdown template to SDCStudio 4. AI generates your data model

2. SDCObsidianTemplate (Obsidian Vault Templates)

Repository: https://github.com/SemanticDataCharter/SDCObsidianTemplate

Purpose: Use Obsidian note-taking app to create SDC4 data models

Best For: - Knowledge management workflows - Complex domain modeling - Linked data structures - Documentation-driven design - Version-controlled model specifications

How It Works: 1. Use the Obsidian vault template 2. Create data model specifications as notes 3. Link and organize your model components 4. Export Markdown and upload to SDCStudio 5. AI processes your structured specification

Why Obsidian? - Visual graph view of model relationships - Bi-directional linking between components - Rich Markdown editing with templates - Local files with Git integration - Plugin ecosystem for enhancements

Unsupported Formats

The following formats are not currently supported:

❌ JSON: Planned for future release
❌ XML: Planned for future release
❌ PDF: Not supported for data upload
❌ DOCX: Not supported for data upload
❌ Excel (.xlsx): Export to CSV first

Workarounds: - Excel/Google Sheets: Export as CSV - Databases: Export query results as CSV - JSON/XML Data: Convert to CSV or Markdown template - PDF Documents: Extract tables to CSV or transcribe to Markdown

Upload Process

Step-by-Step Upload

Navigate to Your Project:
Sign in to SDCStudio React interface
Click "Projects" in the navigation
Open the project where you want to upload data
Click the "Data Sources" tab
Start Upload:
Click the "Upload Data" or "Upload New Data" button
A file picker dialog will appear
Or drag and drop your file onto the upload area
Select Your File:
Choose your CSV or Markdown (.md) file
Verify the file name is correct
Maximum 50 MB per file (10 MB recommended)
Confirm Upload:
Review the file name and size
Click "Upload" or "Upload and Process"
You'll see the file appear in your Data Sources list
Monitor Progress:
Watch status badge change in real-time
React interface updates automatically (every 60 seconds)
Click on data source name for detailed processing logs
No manual refresh needed!

File Size Limits

Maximum File Size: 50 MB per file
Recommended Size: Under 10 MB for faster processing
Large Files: Contact support for bulk data processing

For Large Datasets: - Split into multiple smaller CSV files - Upload representative samples first - Validate model with samples before processing full dataset - Contact support@axius-sdc.com for increased limits

Two-Stage Processing Pipeline

SDCStudio uses a sophisticated two-stage pipeline that balances speed with comprehensive AI analysis.

Stage 1: Structural Parsing

Duration: 30 seconds - 2 minutes Status Flow: UPLOADED → PARSING → PARSED

What Happens

File Reading: - Format detection (CSV vs Markdown) - Character encoding detection (UTF-8 validation) - Structure analysis and validation

For CSV Files: - Header row extraction - Column identification and naming - Row count and data sampling - Initial delimiter detection

For Markdown Templates: - Header hierarchy parsing - Section structure analysis - Component definition extraction - Relationship mapping

Initial Type Inference: - Numeric detection (integers, decimals) - Date/time pattern recognition - Text pattern analysis - Boolean value detection - Email/phone/URL pattern detection

Output: - ParsedData record created in database - Column/component information stored - Basic structure mapped and cached - Ready for AI enhancement in Stage 2

What You See in React Interface

Status badge changes from UPLOADED to PARSING
Typically completes within 1 minute
Badge turns to PARSED when complete
No user action required - fully automatic
Progress updates in real-time

Stage 2: AI Enhancement

Duration: 1-5 minutes (depends on complexity and ontologies) Status Flow: PARSED → AGENT_PROCESSING → COMPLETED

What Happens

Multi-Agent Processing System:

DataModelAgent (Orchestrator):
Creates the main data model container
Coordinates all specialized agents
Manages workflow and state
Uses your uploaded ontologies for context
ClusterAgent (Structure):
Creates the main data cluster
Organizes components logically
Defines cluster hierarchy
Groups related fields
Type-Specific Agents (Component Creation):
XdStringAgent: Processes text columns
XdCountAgent: Processes integer columns
XdQuantityAgent: Processes decimal/measurement columns
XdTemporalAgent: Processes date/time columns
XdBooleanAgent: Processes true/false columns
XdFloatAgent: Processes floating-point numbers

AI Analysis Per Column (Using Your Ontologies!):

Semantic Context Understanding:
Column name interpretation
Business meaning inference
Ontology matching from your uploaded vocabularies
Domain-specific terminology recognition
Pattern Recognition:
Email format detection
Phone number patterns
URL/URI patterns
Date format standardization
Enumeration value detection
Validation Rule Suggestions:
Min/max value ranges
Required vs optional fields
Format patterns (regex)
Allowed value lists (enumerations)
Data quality constraints
Semantic Linking:
Matches to ontology concepts
Standard vocabulary alignment
FHIR, NIEM, SNOMED, LOINC mappings
Custom organization vocabularies

Component Creation: - Individual SDC4 component for each column - Appropriate data type assignment (XdString, XdCount, etc.) - Validation rules and constraints applied - Human-readable labels and descriptions - Semantic definitions from ontologies - Documentation and examples

What You See in React Interface

Status badge changes to AGENT_PROCESSING
Processing typically takes 1-5 minutes
Longer for complex files or many columns
Your ontologies are working here! (Phase 2)
Badge turns green and shows COMPLETED when done
Or shows red AGENT_ERROR if there's an issue
Automatic updates - interface refreshes every 60 seconds

Status Reference

Status	Badge Color	Meaning	Duration	Action
`UPLOADED`	Blue	File received by server	< 10 sec	Auto: starts parsing
`PARSING`	Blue	Reading file structure	30s - 2min	Wait - automatic
`PARSED`	Blue	Structure analyzed	< 10 sec	Auto: starts AI processing
`AGENT_PROCESSING`	Yellow	AI creating components with ontologies	1-5 min	Wait - AI working
`COMPLETED`	Green ✅	Model ready to review	-	Review and customize
`AGENT_ERROR`	Red ⚠️	AI processing failed	-	Click "Retry" button
`ERROR`	Red ❌	File processing failed	-	Check file, re-upload

Understanding AI Analysis

What the AI Looks For

Column Analysis Process

Data Type Detection: - Statistical analysis of all values - Pattern recognition across rows - Format consistency checking - Range and distribution analysis

Semantic Understanding: - Column name interpretation (e.g., "email" → email validation) - Value pattern analysis (e.g., @symbol → email type) - Ontology vocabulary matching (your uploaded ontologies!) - Context from neighboring columns - Business domain recognition

Validation Rules: - Automatic min/max value detection - Required vs optional inference (null count) - Format pattern generation (regex) - Enumeration detection (limited value sets) - Data quality constraints

Example: Email Column

Input CSV Data:

email
user1@example.com
user2@example.com
admin@test.org

AI Analysis Output: - Type: XdString (text) - Pattern Detected: Email format with @ and domain - Validation Rule: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ - Max Length: 320 characters (RFC 5321 standard) - Required: Yes (no null values found in sample) - Label: "Email Address" (inferred from column name) - Ontology Match: schema.org email property (if ontology uploaded)

Example: Date Column

Input CSV Data:

signup_date
2024-01-15
2024-02-20
2024-03-10

AI Analysis Output: - Type: XdTemporal (date/time) - Format Detected: ISO 8601 date (YYYY-MM-DD) - Validation: Must be valid calendar date - Range Detected: 2024-01-15 to 2024-03-10 - Required: Yes - Label: "Signup Date" or "Registration Date" - Semantic: Temporal event marker

Example: Status Column (Enumeration)

Input CSV Data:

status
active
inactive
active
active

AI Analysis Output: - Type: XdString (text) - Pattern Detected: Categorical/enumeration with limited values - Allowed Values: ["active", "inactive"] - Default Value: "active" (most common, 75% frequency) - Required: Yes - Label: "Status" or "Account Status" - Enumeration: Two-state categorization detected

Example: Quantity with Units

Input CSV Data:

weight_kg
65.5
72.0
58.3

AI Analysis Output: - Type: XdQuantity (quantified measurement) - Units Detected: kilograms (from column name "_kg") - Units: kg (standardized) - Range: 58.3 - 72.0 - Precision: 1 decimal place - Required: Likely yes - Label: "Weight" - Semantic: Physical measurement with units

Factors Affecting AI Quality

✅ Better AI Results: - Clear, descriptive column names (customer_email not col1) - Consistent data formats throughout column - Sufficient data samples (10+ rows recommended) - Clean, validated data before upload - Meaningful value patterns - Uploaded ontologies relevant to your domain! - Standard vocabulary usage (e.g., ISO codes)

❌ AI Challenges: - Ambiguous column names (data1, field_a, col2) - Mixed data types within same column - Very sparse data (lots of nulls) - Inconsistent date/number formats - Too few samples (< 5 rows) - Special characters or encoding issues - No ontologies uploaded (generic suggestions only)

How Ontologies Improve AI

Without Ontologies: - Generic type detection - Basic pattern matching - Simple validation rules - Generic labels

With Ontologies (Uploaded in Settings): - Domain-specific type mapping - Semantic vocabulary alignment - Standard terminology usage - Richer validation rules from domain knowledge - Better labels and descriptions - Ontology property mappings

Example - Healthcare Data with FHIR Ontology:

patient_mrn,diagnosis_code,procedure_date
MRN12345,E11.9,2024-01-15

Without ontology: Generic text/date types
With FHIR ontology:
patient_mrn → FHIR Patient.identifier
diagnosis_code → FHIR Condition.code (ICD-10)
procedure_date → FHIR Procedure.performedDateTime

Best Practices

Prepare Your Data

Column Names: - Descriptive: customer_email not email1 - Consistent: first_name, last_name (not firstName, surname) - No Special Characters: order_date not order-date or order.date - Lowercase with Underscores: total_price_usd for readability - Include Units: weight_kg, distance_m, temp_celsius

Data Quality: - Remove exact duplicate rows - Handle missing values consistently (use empty string or standard null) - Use ISO 8601 for dates: YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS - Ensure consistent data types per column (no mixing) - Validate data integrity before upload - Use standard codes (ISO country codes, currency codes, etc.)

Data Volume: - Minimum: 10 rows for reliable AI analysis - Recommended: 20-100 rows for best type inference - Include Edge Cases: Min/max values, boundary conditions - First Upload: Under 10,000 rows for faster processing - Full Dataset: Can upload more data after model validated

File Preparation

CSV Files:

✅ Good CSV Example:
customer_id,first_name,last_name,email,signup_date,status
1001,Jane,Smith,jane@example.com,2024-01-15,active
1002,John,Doe,john@example.com,2024-02-20,active
1003,Maria,Garcia,maria@example.com,2024-03-10,inactive

❌ Bad CSV Example:
Col1,Col2,Col3,Col4,Col5,Col6
1001,Jane,Smith,jane@example.com,1/15/24,1
,John,Doe,john@example.com,2-20-2024,
1003,Maria,Garcia,"maria@example.com",03-10-24,0

Issues in Bad Example: - ❌ Generic column names (Col1, Col2) - ❌ Inconsistent date formats - ❌ Missing values without clear pattern - ❌ Numeric status instead of descriptive

Encoding Best Practices: - ✅ Always use UTF-8 encoding - ✅ Test with international characters (é, ñ, ü, 中文) - ❌ Avoid proprietary encoding (Windows-1252, etc.) - ❌ Don't use Excel default encoding (can corrupt UTF-8)

CSV Structure Requirements: - One header row at the very top - Consistent number of columns per row - Quote text fields that contain commas: "Smith, Jr." - Escape quotes in quoted fields: "She said ""hello""" - No empty rows between data rows - No extra columns or merged cells

Markdown Template Best Practices: - Use official templates (Form2SDCTemplate or SDCObsidianTemplate) - Follow consistent header hierarchy (H1, H2, H3) - Include clear component definitions - Document validation rules explicitly - Add examples for clarity - Version control your templates with Git

Upload Workflow Tips

Start Small:
Upload a sample of your data first (50-100 rows)
Validate the generated model
Refine your data based on results
Upload full dataset after validation
Use Settings First:
Upload relevant ontologies before data upload
Configure profile information
This improves AI suggestions significantly
Organize by Project:
Group related datasets in same project
AI learns from previous uploads in project
Reuse components across datasets
Iterate and Refine:
Review AI-generated components
Customize validation rules
Add business context and documentation
System learns from your corrections

Troubleshooting

Upload Fails

"File too large" Error: - Cause: File exceeds 50 MB limit - Solution: - Reduce rows to under 10,000 - Split into multiple files - Remove unnecessary columns - Contact support@axius-sdc.com for bulk upload options

"Invalid file format" Error: - Cause: File is not CSV or Markdown - Solution: - Verify file extension is .csv or .md - Check file isn't corrupted (open in text editor) - Ensure UTF-8 encoding - Convert JSON/XML/Excel to CSV - Don't upload PDF or DOCX files

"Upload timeout" Error: - Cause: Network or server issue - Solution: - Check internet connection stability - Try smaller file (reduce rows) - Clear browser cache and cookies - Try different browser (Chrome recommended) - Disable browser extensions temporarily - Try again during off-peak hours

Parsing Fails

Status Stuck on PARSING: - Wait: Give it 2-3 minutes - Refresh: Click browser refresh or check status - Try Again: Re-upload if still stuck after 5 minutes - Check File: Open CSV in text editor to verify format - Contact: support@axius-sdc.com if persists

Status Shows ERROR: - View Details: Click on data source name for error log - Common Issues: - Malformed CSV: Missing quotes around text with commas - Extra Columns: Some rows have more columns than header - Invalid Encoding: Non-UTF-8 characters - Empty File: File has headers but no data rows - No Headers: CSV missing header row - Fix and Re-upload: Correct issue in source file and upload again

Agent Processing Fails

Status Shows AGENT_ERROR:

First Action: Click the "Retry" button - Most AI processing issues are transient - Retry succeeds 80% of the time - No need to re-upload file

If Retry Fails: 1. Check Error Log: - Click on data source name - View processing logs - Look for specific error messages

Common Causes:
Very ambiguous column names: Rename columns to be more descriptive
Extremely mixed data: Clean data to have consistent types
Insufficient samples: Add more data rows (minimum 10)
LLM service unavailable: Temporary service issue, retry later
No ontologies: Upload relevant ontologies in Settings first
Solutions:
Improve column names: Use descriptive names
Clean your data: Ensure type consistency
Add more rows: Include at least 10-20 samples
Upload ontologies: Helps AI understand domain
Simplify first: Try subset of columns first

Processing Takes Too Long (> 10 minutes): - Wait a bit more: Large files (1000+ rows, 50+ columns) can take 10-15 minutes - Refresh page: Check if status actually updated - Check system status: May be high load period - Contact support: If exceeds 20 minutes

After Processing

Once your file shows COMPLETED status with green badge:

1. Review the Generated Data Model

Navigate: Click on "Data Models" tab in your project
Find Your Model: Named after your uploaded file
Examine Components: Click through each generated component
Check Types: Verify XdString, XdCount, XdTemporal assignments
Review Validation: Check min/max, patterns, required fields

2. Customize Components

Edit: Click "Edit" on any component
Adjust: Modify types, validation rules, descriptions
Enhance: Add business context and documentation
Relate: Link components with relationships
See: Data Modeling Guide for details

3. Add Semantic Enrichment

Ontology Links: Map components to ontology concepts
Standard Vocabularies: Align with FHIR, NIEM, SNOMED, etc.
Documentation: Add rich descriptions and examples
See: Semantic Enhancement Guide

4. Publish Your Model

When Satisfied: Review all components thoroughly
Publish: Makes model immutable and enables generation
Generate Outputs: XSD, XML, JSON, RDF, SHACL, GQL, HTML
See: Generating Outputs Guide

Template Resources

Official Templates

Form2SDCTemplate: - Repository: https://github.com/SemanticDataCharter/Form2SDCTemplate - Documentation: See repository README - Use Case: Google Forms → SDC4 data models - Best For: Collaborative design, rapid prototyping

SDCObsidianTemplate: - Repository: https://github.com/SemanticDataCharter/SDCObsidianTemplate - Documentation: See repository README - Use Case: Obsidian notes → SDC4 data models - Best For: Knowledge management, complex domains, documentation-driven design

Getting Started with Templates

Clone repository or download ZIP
Review examples included in repository
Customize template for your domain
Export Markdown file
Upload to SDCStudio via Data Sources
AI processes and creates your model

Next Steps

Learn More

Data Modeling Guide - Customize your generated model
AI Processing Guide - Deep dive into AI analysis
Semantic Enhancement - Improve AI with ontologies
Generating Outputs - Create schemas and applications

Try It Out

Upload a CSV: Start with simple tabular data
Try a Template: Use Form2SDCTemplate or SDCObsidianTemplate
Upload Ontologies First: Configure Settings before uploading data
Iterate: Review, customize, publish, generate

Getting Help

Troubleshooting Guide - Common issues and solutions
Template Repositories: See READMEs for detailed instructions
Support Email: support@axius-sdc.com
Community: Join discussions and share templates

Ready to upload? Make sure you've uploaded your ontologies in Settings first, then head to your project and click "Upload Data" to get started!