Uploading and Processing Data
Overview
SDCStudio's intelligent data processing pipeline analyzes your uploaded files and automatically creates comprehensive data models. This guide explains how to upload files, understand the two-stage processing system, and get the best results from AI analysis.
Supported File Formats
SDCStudio currently supports two file formats for data upload:
CSV Files (Recommended for Tabular Data)
Best For: Tabular data, spreadsheets, database exports, structured datasets
Requirements: - Must have a header row with column names - UTF-8 encoding required - Comma-separated values - Consistent data types per column
Why CSV? - Fast Processing: Quick parsing and analysis - Clear Structure: Columns and types easily detected - Best AI Results: Excellent for type inference and validation - Universal Format: Works with any spreadsheet or database
Example:
product_id,product_name,price,in_stock,last_updated
101,Laptop,999.99,true,2024-10-01
102,Mouse,29.99,true,2024-10-02
103,Keyboard,79.99,false,2024-10-03
When to Use CSV: - Exporting data from databases - Spreadsheet data with clear columns - Large datasets with consistent structure - First-time users learning SDCStudio
Markdown Templates (Recommended for Data Models from Documentation)
Best For: Structured specifications, data dictionaries, documentation-driven development, domain models
Requirements: - Structured with clear headers and sections - Follows SDC template format - UTF-8 encoding - Consistent formatting
Why Markdown Templates? - Documentation First: Design your model in readable documentation - Human Readable: Easy to review and collaborate - Version Control: Works great with Git - Flexible: Complex hierarchies and relationships - Domain Modeling: Perfect for capturing domain knowledge
Template Repositories:
We provide two official Markdown template systems:
1. Form2SDCTemplate (Google Forms to SDC)
Repository: https://github.com/SemanticDataCharter/Form2SDCTemplate
Purpose: Convert Google Forms into SDC4-compliant data models
Best For: - Rapid data model prototyping - Collaborative model design with stakeholders - Survey-based data structures - Form-driven workflows
How It Works: 1. Create a Google Form with your data fields 2. Export to the SDC template format 3. Upload Markdown template to SDCStudio 4. AI generates your data model
2. SDCObsidianTemplate (Obsidian Vault Templates)
Repository: https://github.com/SemanticDataCharter/SDCObsidianTemplate
Purpose: Use Obsidian note-taking app to create SDC4 data models
Best For: - Knowledge management workflows - Complex domain modeling - Linked data structures - Documentation-driven design - Version-controlled model specifications
How It Works: 1. Use the Obsidian vault template 2. Create data model specifications as notes 3. Link and organize your model components 4. Export Markdown and upload to SDCStudio 5. AI processes your structured specification
Why Obsidian? - Visual graph view of model relationships - Bi-directional linking between components - Rich Markdown editing with templates - Local files with Git integration - Plugin ecosystem for enhancements
Unsupported Formats
The following formats are not currently supported:
- ❌ JSON: Planned for future release
- ❌ XML: Planned for future release
- ❌ PDF: Not supported for data upload
- ❌ DOCX: Not supported for data upload
- ❌ Excel (.xlsx): Export to CSV first
Workarounds: - Excel/Google Sheets: Export as CSV - Databases: Export query results as CSV - JSON/XML Data: Convert to CSV or Markdown template - PDF Documents: Extract tables to CSV or transcribe to Markdown
Upload Process
Step-by-Step Upload
- Navigate to Your Project:
- Sign in to SDCStudio React interface
- Click "Projects" in the navigation
- Open the project where you want to upload data
-
Click the "Data Sources" tab
-
Start Upload:
- Click the "Upload Data" or "Upload New Data" button
- A file picker dialog will appear
-
Or drag and drop your file onto the upload area
-
Select Your File:
- Choose your CSV or Markdown (.md) file
- Verify the file name is correct
-
Maximum 50 MB per file (10 MB recommended)
-
Confirm Upload:
- Review the file name and size
- Click "Upload" or "Upload and Process"
-
You'll see the file appear in your Data Sources list
-
Monitor Progress:
- Watch status badge change in real-time
- React interface updates automatically (every 60 seconds)
- Click on data source name for detailed processing logs
- No manual refresh needed!
File Size Limits
- Maximum File Size: 50 MB per file
- Recommended Size: Under 10 MB for faster processing
- Large Files: Contact support for bulk data processing
For Large Datasets: - Split into multiple smaller CSV files - Upload representative samples first - Validate model with samples before processing full dataset - Contact support@axius-sdc.com for increased limits
Two-Stage Processing Pipeline
SDCStudio uses a sophisticated two-stage pipeline that balances speed with comprehensive AI analysis.
Stage 1: Structural Parsing
Duration: 30 seconds - 2 minutes
Status Flow: UPLOADED → PARSING → PARSED
What Happens
File Reading: - Format detection (CSV vs Markdown) - Character encoding detection (UTF-8 validation) - Structure analysis and validation
For CSV Files: - Header row extraction - Column identification and naming - Row count and data sampling - Initial delimiter detection
For Markdown Templates: - Header hierarchy parsing - Section structure analysis - Component definition extraction - Relationship mapping
Initial Type Inference: - Numeric detection (integers, decimals) - Date/time pattern recognition - Text pattern analysis - Boolean value detection - Email/phone/URL pattern detection
Output:
- ParsedData record created in database
- Column/component information stored
- Basic structure mapped and cached
- Ready for AI enhancement in Stage 2
What You See in React Interface
- Status badge changes from
UPLOADEDtoPARSING - Typically completes within 1 minute
- Badge turns to
PARSEDwhen complete - No user action required - fully automatic
- Progress updates in real-time
Stage 2: AI Enhancement
Duration: 1-5 minutes (depends on complexity and ontologies)
Status Flow: PARSED → AGENT_PROCESSING → COMPLETED
What Happens
Multi-Agent Processing System:
- DataModelAgent (Orchestrator):
- Creates the main data model container
- Coordinates all specialized agents
- Manages workflow and state
-
Uses your uploaded ontologies for context
-
ClusterAgent (Structure):
- Creates the main data cluster
- Organizes components logically
- Defines cluster hierarchy
-
Groups related fields
-
Type-Specific Agents (Component Creation):
- XdStringAgent: Processes text columns
- XdCountAgent: Processes integer columns
- XdQuantityAgent: Processes decimal/measurement columns
- XdTemporalAgent: Processes date/time columns
- XdBooleanAgent: Processes true/false columns
- XdFloatAgent: Processes floating-point numbers
AI Analysis Per Column (Using Your Ontologies!):
- Semantic Context Understanding:
- Column name interpretation
- Business meaning inference
- Ontology matching from your uploaded vocabularies
-
Domain-specific terminology recognition
-
Pattern Recognition:
- Email format detection
- Phone number patterns
- URL/URI patterns
- Date format standardization
-
Enumeration value detection
-
Validation Rule Suggestions:
- Min/max value ranges
- Required vs optional fields
- Format patterns (regex)
- Allowed value lists (enumerations)
-
Data quality constraints
-
Semantic Linking:
- Matches to ontology concepts
- Standard vocabulary alignment
- FHIR, NIEM, SNOMED, LOINC mappings
- Custom organization vocabularies
Component Creation: - Individual SDC4 component for each column - Appropriate data type assignment (XdString, XdCount, etc.) - Validation rules and constraints applied - Human-readable labels and descriptions - Semantic definitions from ontologies - Documentation and examples
What You See in React Interface
- Status badge changes to
AGENT_PROCESSING - Processing typically takes 1-5 minutes
- Longer for complex files or many columns
- Your ontologies are working here! (Phase 2)
- Badge turns green and shows
COMPLETEDwhen done - Or shows red
AGENT_ERRORif there's an issue - Automatic updates - interface refreshes every 60 seconds
Status Reference
| Status | Badge Color | Meaning | Duration | Action |
|---|---|---|---|---|
UPLOADED |
Blue | File received by server | < 10 sec | Auto: starts parsing |
PARSING |
Blue | Reading file structure | 30s - 2min | Wait - automatic |
PARSED |
Blue | Structure analyzed | < 10 sec | Auto: starts AI processing |
AGENT_PROCESSING |
Yellow | AI creating components with ontologies | 1-5 min | Wait - AI working |
COMPLETED |
Green ✅ | Model ready to review | - | Review and customize |
AGENT_ERROR |
Red ⚠️ | AI processing failed | - | Click "Retry" button |
ERROR |
Red ❌ | File processing failed | - | Check file, re-upload |
Understanding AI Analysis
What the AI Looks For
Column Analysis Process
Data Type Detection: - Statistical analysis of all values - Pattern recognition across rows - Format consistency checking - Range and distribution analysis
Semantic Understanding: - Column name interpretation (e.g., "email" → email validation) - Value pattern analysis (e.g., @symbol → email type) - Ontology vocabulary matching (your uploaded ontologies!) - Context from neighboring columns - Business domain recognition
Validation Rules: - Automatic min/max value detection - Required vs optional inference (null count) - Format pattern generation (regex) - Enumeration detection (limited value sets) - Data quality constraints
Example: Email Column
Input CSV Data:
email
user1@example.com
user2@example.com
admin@test.org
AI Analysis Output:
- Type: XdString (text)
- Pattern Detected: Email format with @ and domain
- Validation Rule: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
- Max Length: 320 characters (RFC 5321 standard)
- Required: Yes (no null values found in sample)
- Label: "Email Address" (inferred from column name)
- Ontology Match: schema.org email property (if ontology uploaded)
Example: Date Column
Input CSV Data:
signup_date
2024-01-15
2024-02-20
2024-03-10
AI Analysis Output: - Type: XdTemporal (date/time) - Format Detected: ISO 8601 date (YYYY-MM-DD) - Validation: Must be valid calendar date - Range Detected: 2024-01-15 to 2024-03-10 - Required: Yes - Label: "Signup Date" or "Registration Date" - Semantic: Temporal event marker
Example: Status Column (Enumeration)
Input CSV Data:
status
active
inactive
active
active
AI Analysis Output:
- Type: XdString (text)
- Pattern Detected: Categorical/enumeration with limited values
- Allowed Values: ["active", "inactive"]
- Default Value: "active" (most common, 75% frequency)
- Required: Yes
- Label: "Status" or "Account Status"
- Enumeration: Two-state categorization detected
Example: Quantity with Units
Input CSV Data:
weight_kg
65.5
72.0
58.3
AI Analysis Output:
- Type: XdQuantity (quantified measurement)
- Units Detected: kilograms (from column name "_kg")
- Units: kg (standardized)
- Range: 58.3 - 72.0
- Precision: 1 decimal place
- Required: Likely yes
- Label: "Weight"
- Semantic: Physical measurement with units
Factors Affecting AI Quality
✅ Better AI Results:
- Clear, descriptive column names (customer_email not col1)
- Consistent data formats throughout column
- Sufficient data samples (10+ rows recommended)
- Clean, validated data before upload
- Meaningful value patterns
- Uploaded ontologies relevant to your domain!
- Standard vocabulary usage (e.g., ISO codes)
❌ AI Challenges:
- Ambiguous column names (data1, field_a, col2)
- Mixed data types within same column
- Very sparse data (lots of nulls)
- Inconsistent date/number formats
- Too few samples (< 5 rows)
- Special characters or encoding issues
- No ontologies uploaded (generic suggestions only)
How Ontologies Improve AI
Without Ontologies: - Generic type detection - Basic pattern matching - Simple validation rules - Generic labels
With Ontologies (Uploaded in Settings): - Domain-specific type mapping - Semantic vocabulary alignment - Standard terminology usage - Richer validation rules from domain knowledge - Better labels and descriptions - Ontology property mappings
Example - Healthcare Data with FHIR Ontology:
patient_mrn,diagnosis_code,procedure_date
MRN12345,E11.9,2024-01-15
- Without ontology: Generic text/date types
- With FHIR ontology:
patient_mrn→ FHIR Patient.identifierdiagnosis_code→ FHIR Condition.code (ICD-10)procedure_date→ FHIR Procedure.performedDateTime
Best Practices
Prepare Your Data
Column Names:
- Descriptive: customer_email not email1
- Consistent: first_name, last_name (not firstName, surname)
- No Special Characters: order_date not order-date or order.date
- Lowercase with Underscores: total_price_usd for readability
- Include Units: weight_kg, distance_m, temp_celsius
Data Quality:
- Remove exact duplicate rows
- Handle missing values consistently (use empty string or standard null)
- Use ISO 8601 for dates: YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS
- Ensure consistent data types per column (no mixing)
- Validate data integrity before upload
- Use standard codes (ISO country codes, currency codes, etc.)
Data Volume: - Minimum: 10 rows for reliable AI analysis - Recommended: 20-100 rows for best type inference - Include Edge Cases: Min/max values, boundary conditions - First Upload: Under 10,000 rows for faster processing - Full Dataset: Can upload more data after model validated
File Preparation
CSV Files:
✅ Good CSV Example:
customer_id,first_name,last_name,email,signup_date,status
1001,Jane,Smith,jane@example.com,2024-01-15,active
1002,John,Doe,john@example.com,2024-02-20,active
1003,Maria,Garcia,maria@example.com,2024-03-10,inactive
❌ Bad CSV Example:
Col1,Col2,Col3,Col4,Col5,Col6
1001,Jane,Smith,jane@example.com,1/15/24,1
,John,Doe,john@example.com,2-20-2024,
1003,Maria,Garcia,"maria@example.com",03-10-24,0
Issues in Bad Example: - ❌ Generic column names (Col1, Col2) - ❌ Inconsistent date formats - ❌ Missing values without clear pattern - ❌ Numeric status instead of descriptive
Encoding Best Practices: - ✅ Always use UTF-8 encoding - ✅ Test with international characters (é, ñ, ü, 中文) - ❌ Avoid proprietary encoding (Windows-1252, etc.) - ❌ Don't use Excel default encoding (can corrupt UTF-8)
CSV Structure Requirements:
- One header row at the very top
- Consistent number of columns per row
- Quote text fields that contain commas: "Smith, Jr."
- Escape quotes in quoted fields: "She said ""hello"""
- No empty rows between data rows
- No extra columns or merged cells
Markdown Template Best Practices: - Use official templates (Form2SDCTemplate or SDCObsidianTemplate) - Follow consistent header hierarchy (H1, H2, H3) - Include clear component definitions - Document validation rules explicitly - Add examples for clarity - Version control your templates with Git
Upload Workflow Tips
- Start Small:
- Upload a sample of your data first (50-100 rows)
- Validate the generated model
- Refine your data based on results
-
Upload full dataset after validation
-
Use Settings First:
- Upload relevant ontologies before data upload
- Configure profile information
-
This improves AI suggestions significantly
-
Organize by Project:
- Group related datasets in same project
- AI learns from previous uploads in project
-
Reuse components across datasets
-
Iterate and Refine:
- Review AI-generated components
- Customize validation rules
- Add business context and documentation
- System learns from your corrections
Troubleshooting
Upload Fails
"File too large" Error: - Cause: File exceeds 50 MB limit - Solution: - Reduce rows to under 10,000 - Split into multiple files - Remove unnecessary columns - Contact support@axius-sdc.com for bulk upload options
"Invalid file format" Error:
- Cause: File is not CSV or Markdown
- Solution:
- Verify file extension is .csv or .md
- Check file isn't corrupted (open in text editor)
- Ensure UTF-8 encoding
- Convert JSON/XML/Excel to CSV
- Don't upload PDF or DOCX files
"Upload timeout" Error: - Cause: Network or server issue - Solution: - Check internet connection stability - Try smaller file (reduce rows) - Clear browser cache and cookies - Try different browser (Chrome recommended) - Disable browser extensions temporarily - Try again during off-peak hours
Parsing Fails
Status Stuck on PARSING:
- Wait: Give it 2-3 minutes
- Refresh: Click browser refresh or check status
- Try Again: Re-upload if still stuck after 5 minutes
- Check File: Open CSV in text editor to verify format
- Contact: support@axius-sdc.com if persists
Status Shows ERROR:
- View Details: Click on data source name for error log
- Common Issues:
- Malformed CSV: Missing quotes around text with commas
- Extra Columns: Some rows have more columns than header
- Invalid Encoding: Non-UTF-8 characters
- Empty File: File has headers but no data rows
- No Headers: CSV missing header row
- Fix and Re-upload: Correct issue in source file and upload again
Agent Processing Fails
Status Shows AGENT_ERROR:
First Action: Click the "Retry" button - Most AI processing issues are transient - Retry succeeds 80% of the time - No need to re-upload file
If Retry Fails: 1. Check Error Log: - Click on data source name - View processing logs - Look for specific error messages
- Common Causes:
- Very ambiguous column names: Rename columns to be more descriptive
- Extremely mixed data: Clean data to have consistent types
- Insufficient samples: Add more data rows (minimum 10)
- LLM service unavailable: Temporary service issue, retry later
-
No ontologies: Upload relevant ontologies in Settings first
-
Solutions:
- Improve column names: Use descriptive names
- Clean your data: Ensure type consistency
- Add more rows: Include at least 10-20 samples
- Upload ontologies: Helps AI understand domain
- Simplify first: Try subset of columns first
Processing Takes Too Long (> 10 minutes): - Wait a bit more: Large files (1000+ rows, 50+ columns) can take 10-15 minutes - Refresh page: Check if status actually updated - Check system status: May be high load period - Contact support: If exceeds 20 minutes
After Processing
Once your file shows COMPLETED status with green badge:
1. Review the Generated Data Model
- Navigate: Click on "Data Models" tab in your project
- Find Your Model: Named after your uploaded file
- Examine Components: Click through each generated component
- Check Types: Verify XdString, XdCount, XdTemporal assignments
- Review Validation: Check min/max, patterns, required fields
2. Customize Components
- Edit: Click "Edit" on any component
- Adjust: Modify types, validation rules, descriptions
- Enhance: Add business context and documentation
- Relate: Link components with relationships
- See: Data Modeling Guide for details
3. Add Semantic Enrichment
- Ontology Links: Map components to ontology concepts
- Standard Vocabularies: Align with FHIR, NIEM, SNOMED, etc.
- Documentation: Add rich descriptions and examples
- See: Semantic Enhancement Guide
4. Publish Your Model
- When Satisfied: Review all components thoroughly
- Publish: Makes model immutable and enables generation
- Generate Outputs: XSD, XML, JSON, RDF, SHACL, GQL, HTML
- See: Generating Outputs Guide
Template Resources
Official Templates
Form2SDCTemplate: - Repository: https://github.com/SemanticDataCharter/Form2SDCTemplate - Documentation: See repository README - Use Case: Google Forms → SDC4 data models - Best For: Collaborative design, rapid prototyping
SDCObsidianTemplate: - Repository: https://github.com/SemanticDataCharter/SDCObsidianTemplate - Documentation: See repository README - Use Case: Obsidian notes → SDC4 data models - Best For: Knowledge management, complex domains, documentation-driven design
Getting Started with Templates
- Clone repository or download ZIP
- Review examples included in repository
- Customize template for your domain
- Export Markdown file
- Upload to SDCStudio via Data Sources
- AI processes and creates your model
Next Steps
Learn More
- Data Modeling Guide - Customize your generated model
- AI Processing Guide - Deep dive into AI analysis
- Semantic Enhancement - Improve AI with ontologies
- Generating Outputs - Create schemas and applications
Try It Out
- Upload a CSV: Start with simple tabular data
- Try a Template: Use Form2SDCTemplate or SDCObsidianTemplate
- Upload Ontologies First: Configure Settings before uploading data
- Iterate: Review, customize, publish, generate
Getting Help
- Troubleshooting Guide - Common issues and solutions
- Template Repositories: See READMEs for detailed instructions
- Support Email: support@axius-sdc.com
- Community: Join discussions and share templates
Ready to upload? Make sure you've uploaded your ontologies in Settings first, then head to your project and click "Upload Data" to get started!