Uploading and Processing Data

Overview

SDCStudio's intelligent data processing pipeline analyzes your uploaded files and automatically creates comprehensive data models. This guide explains how to upload files, understand the two-stage processing system, and get the best results from AI analysis.

Supported File Formats

SDCStudio currently supports two file formats for data upload:

Best For: Tabular data, spreadsheets, database exports, structured datasets

Requirements: - Must have a header row with column names - UTF-8 encoding required - Comma-separated values - Consistent data types per column

Why CSV? - Fast Processing: Quick parsing and analysis - Clear Structure: Columns and types easily detected - Best AI Results: Excellent for type inference and validation - Universal Format: Works with any spreadsheet or database

Example:

product_id,product_name,price,in_stock,last_updated
101,Laptop,999.99,true,2024-10-01
102,Mouse,29.99,true,2024-10-02
103,Keyboard,79.99,false,2024-10-03

When to Use CSV: - Exporting data from databases - Spreadsheet data with clear columns - Large datasets with consistent structure - First-time users learning SDCStudio

Best For: Structured specifications, data dictionaries, documentation-driven development, domain models

Requirements: - Structured with clear headers and sections - Follows SDC template format - UTF-8 encoding - Consistent formatting

Why Markdown Templates? - Documentation First: Design your model in readable documentation - Human Readable: Easy to review and collaborate - Version Control: Works great with Git - Flexible: Complex hierarchies and relationships - Domain Modeling: Perfect for capturing domain knowledge

Template Repositories:

We provide two official Markdown template systems:

1. Form2SDCTemplate (Google Forms to SDC)

Repository: https://github.com/SemanticDataCharter/Form2SDCTemplate

Purpose: Convert Google Forms into SDC4-compliant data models

Best For: - Rapid data model prototyping - Collaborative model design with stakeholders - Survey-based data structures - Form-driven workflows

How It Works: 1. Create a Google Form with your data fields 2. Export to the SDC template format 3. Upload Markdown template to SDCStudio 4. AI generates your data model

2. SDCObsidianTemplate (Obsidian Vault Templates)

Repository: https://github.com/SemanticDataCharter/SDCObsidianTemplate

Purpose: Use Obsidian note-taking app to create SDC4 data models

Best For: - Knowledge management workflows - Complex domain modeling - Linked data structures - Documentation-driven design - Version-controlled model specifications

How It Works: 1. Use the Obsidian vault template 2. Create data model specifications as notes 3. Link and organize your model components 4. Export Markdown and upload to SDCStudio 5. AI processes your structured specification

Why Obsidian? - Visual graph view of model relationships - Bi-directional linking between components - Rich Markdown editing with templates - Local files with Git integration - Plugin ecosystem for enhancements

Unsupported Formats

The following formats are not currently supported:

  • JSON: Planned for future release
  • XML: Planned for future release
  • PDF: Not supported for data upload
  • DOCX: Not supported for data upload
  • Excel (.xlsx): Export to CSV first

Workarounds: - Excel/Google Sheets: Export as CSV - Databases: Export query results as CSV - JSON/XML Data: Convert to CSV or Markdown template - PDF Documents: Extract tables to CSV or transcribe to Markdown

Upload Process

Step-by-Step Upload

  1. Navigate to Your Project:
  2. Sign in to SDCStudio React interface
  3. Click "Projects" in the navigation
  4. Open the project where you want to upload data
  5. Click the "Data Sources" tab

  6. Start Upload:

  7. Click the "Upload Data" or "Upload New Data" button
  8. A file picker dialog will appear
  9. Or drag and drop your file onto the upload area

  10. Select Your File:

  11. Choose your CSV or Markdown (.md) file
  12. Verify the file name is correct
  13. Maximum 50 MB per file (10 MB recommended)

  14. Confirm Upload:

  15. Review the file name and size
  16. Click "Upload" or "Upload and Process"
  17. You'll see the file appear in your Data Sources list

  18. Monitor Progress:

  19. Watch status badge change in real-time
  20. React interface updates automatically (every 60 seconds)
  21. Click on data source name for detailed processing logs
  22. No manual refresh needed!

File Size Limits

  • Maximum File Size: 50 MB per file
  • Recommended Size: Under 10 MB for faster processing
  • Large Files: Contact support for bulk data processing

For Large Datasets: - Split into multiple smaller CSV files - Upload representative samples first - Validate model with samples before processing full dataset - Contact support@axius-sdc.com for increased limits

Two-Stage Processing Pipeline

SDCStudio uses a sophisticated two-stage pipeline that balances speed with comprehensive AI analysis.

Stage 1: Structural Parsing

Duration: 30 seconds - 2 minutes Status Flow: UPLOADEDPARSINGPARSED

What Happens

File Reading: - Format detection (CSV vs Markdown) - Character encoding detection (UTF-8 validation) - Structure analysis and validation

For CSV Files: - Header row extraction - Column identification and naming - Row count and data sampling - Initial delimiter detection

For Markdown Templates: - Header hierarchy parsing - Section structure analysis - Component definition extraction - Relationship mapping

Initial Type Inference: - Numeric detection (integers, decimals) - Date/time pattern recognition - Text pattern analysis - Boolean value detection - Email/phone/URL pattern detection

Output: - ParsedData record created in database - Column/component information stored - Basic structure mapped and cached - Ready for AI enhancement in Stage 2

What You See in React Interface

  • Status badge changes from UPLOADED to PARSING
  • Typically completes within 1 minute
  • Badge turns to PARSED when complete
  • No user action required - fully automatic
  • Progress updates in real-time

Stage 2: AI Enhancement

Duration: 1-5 minutes (depends on complexity and ontologies) Status Flow: PARSEDAGENT_PROCESSINGCOMPLETED

What Happens

Multi-Agent Processing System:

  1. DataModelAgent (Orchestrator):
  2. Creates the main data model container
  3. Coordinates all specialized agents
  4. Manages workflow and state
  5. Uses your uploaded ontologies for context

  6. ClusterAgent (Structure):

  7. Creates the main data cluster
  8. Organizes components logically
  9. Defines cluster hierarchy
  10. Groups related fields

  11. Type-Specific Agents (Component Creation):

  12. XdStringAgent: Processes text columns
  13. XdCountAgent: Processes integer columns
  14. XdQuantityAgent: Processes decimal/measurement columns
  15. XdTemporalAgent: Processes date/time columns
  16. XdBooleanAgent: Processes true/false columns
  17. XdFloatAgent: Processes floating-point numbers

AI Analysis Per Column (Using Your Ontologies!):

  1. Semantic Context Understanding:
  2. Column name interpretation
  3. Business meaning inference
  4. Ontology matching from your uploaded vocabularies
  5. Domain-specific terminology recognition

  6. Pattern Recognition:

  7. Email format detection
  8. Phone number patterns
  9. URL/URI patterns
  10. Date format standardization
  11. Enumeration value detection

  12. Validation Rule Suggestions:

  13. Min/max value ranges
  14. Required vs optional fields
  15. Format patterns (regex)
  16. Allowed value lists (enumerations)
  17. Data quality constraints

  18. Semantic Linking:

  19. Matches to ontology concepts
  20. Standard vocabulary alignment
  21. FHIR, NIEM, SNOMED, LOINC mappings
  22. Custom organization vocabularies

Component Creation: - Individual SDC4 component for each column - Appropriate data type assignment (XdString, XdCount, etc.) - Validation rules and constraints applied - Human-readable labels and descriptions - Semantic definitions from ontologies - Documentation and examples

What You See in React Interface

  • Status badge changes to AGENT_PROCESSING
  • Processing typically takes 1-5 minutes
  • Longer for complex files or many columns
  • Your ontologies are working here! (Phase 2)
  • Badge turns green and shows COMPLETED when done
  • Or shows red AGENT_ERROR if there's an issue
  • Automatic updates - interface refreshes every 60 seconds

Status Reference

Status Badge Color Meaning Duration Action
UPLOADED Blue File received by server < 10 sec Auto: starts parsing
PARSING Blue Reading file structure 30s - 2min Wait - automatic
PARSED Blue Structure analyzed < 10 sec Auto: starts AI processing
AGENT_PROCESSING Yellow AI creating components with ontologies 1-5 min Wait - AI working
COMPLETED Green ✅ Model ready to review - Review and customize
AGENT_ERROR Red ⚠️ AI processing failed - Click "Retry" button
ERROR Red ❌ File processing failed - Check file, re-upload

Understanding AI Analysis

What the AI Looks For

Column Analysis Process

Data Type Detection: - Statistical analysis of all values - Pattern recognition across rows - Format consistency checking - Range and distribution analysis

Semantic Understanding: - Column name interpretation (e.g., "email" → email validation) - Value pattern analysis (e.g., @symbol → email type) - Ontology vocabulary matching (your uploaded ontologies!) - Context from neighboring columns - Business domain recognition

Validation Rules: - Automatic min/max value detection - Required vs optional inference (null count) - Format pattern generation (regex) - Enumeration detection (limited value sets) - Data quality constraints

Example: Email Column

Input CSV Data:

email
user1@example.com
user2@example.com
admin@test.org

AI Analysis Output: - Type: XdString (text) - Pattern Detected: Email format with @ and domain - Validation Rule: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ - Max Length: 320 characters (RFC 5321 standard) - Required: Yes (no null values found in sample) - Label: "Email Address" (inferred from column name) - Ontology Match: schema.org email property (if ontology uploaded)

Example: Date Column

Input CSV Data:

signup_date
2024-01-15
2024-02-20
2024-03-10

AI Analysis Output: - Type: XdTemporal (date/time) - Format Detected: ISO 8601 date (YYYY-MM-DD) - Validation: Must be valid calendar date - Range Detected: 2024-01-15 to 2024-03-10 - Required: Yes - Label: "Signup Date" or "Registration Date" - Semantic: Temporal event marker

Example: Status Column (Enumeration)

Input CSV Data:

status
active
inactive
active
active

AI Analysis Output: - Type: XdString (text) - Pattern Detected: Categorical/enumeration with limited values - Allowed Values: ["active", "inactive"] - Default Value: "active" (most common, 75% frequency) - Required: Yes - Label: "Status" or "Account Status" - Enumeration: Two-state categorization detected

Example: Quantity with Units

Input CSV Data:

weight_kg
65.5
72.0
58.3

AI Analysis Output: - Type: XdQuantity (quantified measurement) - Units Detected: kilograms (from column name "_kg") - Units: kg (standardized) - Range: 58.3 - 72.0 - Precision: 1 decimal place - Required: Likely yes - Label: "Weight" - Semantic: Physical measurement with units

Factors Affecting AI Quality

✅ Better AI Results: - Clear, descriptive column names (customer_email not col1) - Consistent data formats throughout column - Sufficient data samples (10+ rows recommended) - Clean, validated data before upload - Meaningful value patterns - Uploaded ontologies relevant to your domain! - Standard vocabulary usage (e.g., ISO codes)

❌ AI Challenges: - Ambiguous column names (data1, field_a, col2) - Mixed data types within same column - Very sparse data (lots of nulls) - Inconsistent date/number formats - Too few samples (< 5 rows) - Special characters or encoding issues - No ontologies uploaded (generic suggestions only)

How Ontologies Improve AI

Without Ontologies: - Generic type detection - Basic pattern matching - Simple validation rules - Generic labels

With Ontologies (Uploaded in Settings): - Domain-specific type mapping - Semantic vocabulary alignment - Standard terminology usage - Richer validation rules from domain knowledge - Better labels and descriptions - Ontology property mappings

Example - Healthcare Data with FHIR Ontology:

patient_mrn,diagnosis_code,procedure_date
MRN12345,E11.9,2024-01-15
  • Without ontology: Generic text/date types
  • With FHIR ontology:
  • patient_mrn → FHIR Patient.identifier
  • diagnosis_code → FHIR Condition.code (ICD-10)
  • procedure_date → FHIR Procedure.performedDateTime

Best Practices

Prepare Your Data

Column Names: - Descriptive: customer_email not email1 - Consistent: first_name, last_name (not firstName, surname) - No Special Characters: order_date not order-date or order.date - Lowercase with Underscores: total_price_usd for readability - Include Units: weight_kg, distance_m, temp_celsius

Data Quality: - Remove exact duplicate rows - Handle missing values consistently (use empty string or standard null) - Use ISO 8601 for dates: YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS - Ensure consistent data types per column (no mixing) - Validate data integrity before upload - Use standard codes (ISO country codes, currency codes, etc.)

Data Volume: - Minimum: 10 rows for reliable AI analysis - Recommended: 20-100 rows for best type inference - Include Edge Cases: Min/max values, boundary conditions - First Upload: Under 10,000 rows for faster processing - Full Dataset: Can upload more data after model validated

File Preparation

CSV Files:

✅ Good CSV Example:
customer_id,first_name,last_name,email,signup_date,status
1001,Jane,Smith,jane@example.com,2024-01-15,active
1002,John,Doe,john@example.com,2024-02-20,active
1003,Maria,Garcia,maria@example.com,2024-03-10,inactive

❌ Bad CSV Example:
Col1,Col2,Col3,Col4,Col5,Col6
1001,Jane,Smith,jane@example.com,1/15/24,1
,John,Doe,john@example.com,2-20-2024,
1003,Maria,Garcia,"maria@example.com",03-10-24,0

Issues in Bad Example: - ❌ Generic column names (Col1, Col2) - ❌ Inconsistent date formats - ❌ Missing values without clear pattern - ❌ Numeric status instead of descriptive

Encoding Best Practices: - ✅ Always use UTF-8 encoding - ✅ Test with international characters (é, ñ, ü, 中文) - ❌ Avoid proprietary encoding (Windows-1252, etc.) - ❌ Don't use Excel default encoding (can corrupt UTF-8)

CSV Structure Requirements: - One header row at the very top - Consistent number of columns per row - Quote text fields that contain commas: "Smith, Jr." - Escape quotes in quoted fields: "She said ""hello""" - No empty rows between data rows - No extra columns or merged cells

Markdown Template Best Practices: - Use official templates (Form2SDCTemplate or SDCObsidianTemplate) - Follow consistent header hierarchy (H1, H2, H3) - Include clear component definitions - Document validation rules explicitly - Add examples for clarity - Version control your templates with Git

Upload Workflow Tips

  1. Start Small:
  2. Upload a sample of your data first (50-100 rows)
  3. Validate the generated model
  4. Refine your data based on results
  5. Upload full dataset after validation

  6. Use Settings First:

  7. Upload relevant ontologies before data upload
  8. Configure profile information
  9. This improves AI suggestions significantly

  10. Organize by Project:

  11. Group related datasets in same project
  12. AI learns from previous uploads in project
  13. Reuse components across datasets

  14. Iterate and Refine:

  15. Review AI-generated components
  16. Customize validation rules
  17. Add business context and documentation
  18. System learns from your corrections

Troubleshooting

Upload Fails

"File too large" Error: - Cause: File exceeds 50 MB limit - Solution: - Reduce rows to under 10,000 - Split into multiple files - Remove unnecessary columns - Contact support@axius-sdc.com for bulk upload options

"Invalid file format" Error: - Cause: File is not CSV or Markdown - Solution: - Verify file extension is .csv or .md - Check file isn't corrupted (open in text editor) - Ensure UTF-8 encoding - Convert JSON/XML/Excel to CSV - Don't upload PDF or DOCX files

"Upload timeout" Error: - Cause: Network or server issue - Solution: - Check internet connection stability - Try smaller file (reduce rows) - Clear browser cache and cookies - Try different browser (Chrome recommended) - Disable browser extensions temporarily - Try again during off-peak hours

Parsing Fails

Status Stuck on PARSING: - Wait: Give it 2-3 minutes - Refresh: Click browser refresh or check status - Try Again: Re-upload if still stuck after 5 minutes - Check File: Open CSV in text editor to verify format - Contact: support@axius-sdc.com if persists

Status Shows ERROR: - View Details: Click on data source name for error log - Common Issues: - Malformed CSV: Missing quotes around text with commas - Extra Columns: Some rows have more columns than header - Invalid Encoding: Non-UTF-8 characters - Empty File: File has headers but no data rows - No Headers: CSV missing header row - Fix and Re-upload: Correct issue in source file and upload again

Agent Processing Fails

Status Shows AGENT_ERROR:

First Action: Click the "Retry" button - Most AI processing issues are transient - Retry succeeds 80% of the time - No need to re-upload file

If Retry Fails: 1. Check Error Log: - Click on data source name - View processing logs - Look for specific error messages

  1. Common Causes:
  2. Very ambiguous column names: Rename columns to be more descriptive
  3. Extremely mixed data: Clean data to have consistent types
  4. Insufficient samples: Add more data rows (minimum 10)
  5. LLM service unavailable: Temporary service issue, retry later
  6. No ontologies: Upload relevant ontologies in Settings first

  7. Solutions:

  8. Improve column names: Use descriptive names
  9. Clean your data: Ensure type consistency
  10. Add more rows: Include at least 10-20 samples
  11. Upload ontologies: Helps AI understand domain
  12. Simplify first: Try subset of columns first

Processing Takes Too Long (> 10 minutes): - Wait a bit more: Large files (1000+ rows, 50+ columns) can take 10-15 minutes - Refresh page: Check if status actually updated - Check system status: May be high load period - Contact support: If exceeds 20 minutes

After Processing

Once your file shows COMPLETED status with green badge:

1. Review the Generated Data Model

  • Navigate: Click on "Data Models" tab in your project
  • Find Your Model: Named after your uploaded file
  • Examine Components: Click through each generated component
  • Check Types: Verify XdString, XdCount, XdTemporal assignments
  • Review Validation: Check min/max, patterns, required fields

2. Customize Components

  • Edit: Click "Edit" on any component
  • Adjust: Modify types, validation rules, descriptions
  • Enhance: Add business context and documentation
  • Relate: Link components with relationships
  • See: Data Modeling Guide for details

3. Add Semantic Enrichment

  • Ontology Links: Map components to ontology concepts
  • Standard Vocabularies: Align with FHIR, NIEM, SNOMED, etc.
  • Documentation: Add rich descriptions and examples
  • See: Semantic Enhancement Guide

4. Publish Your Model

  • When Satisfied: Review all components thoroughly
  • Publish: Makes model immutable and enables generation
  • Generate Outputs: XSD, XML, JSON, RDF, SHACL, GQL, HTML
  • See: Generating Outputs Guide

Template Resources

Official Templates

Form2SDCTemplate: - Repository: https://github.com/SemanticDataCharter/Form2SDCTemplate - Documentation: See repository README - Use Case: Google Forms → SDC4 data models - Best For: Collaborative design, rapid prototyping

SDCObsidianTemplate: - Repository: https://github.com/SemanticDataCharter/SDCObsidianTemplate - Documentation: See repository README - Use Case: Obsidian notes → SDC4 data models - Best For: Knowledge management, complex domains, documentation-driven design

Getting Started with Templates

  1. Clone repository or download ZIP
  2. Review examples included in repository
  3. Customize template for your domain
  4. Export Markdown file
  5. Upload to SDCStudio via Data Sources
  6. AI processes and creates your model

Next Steps

Learn More

Try It Out

  • Upload a CSV: Start with simple tabular data
  • Try a Template: Use Form2SDCTemplate or SDCObsidianTemplate
  • Upload Ontologies First: Configure Settings before uploading data
  • Iterate: Review, customize, publish, generate

Getting Help

  • Troubleshooting Guide - Common issues and solutions
  • Template Repositories: See READMEs for detailed instructions
  • Support Email: support@axius-sdc.com
  • Community: Join discussions and share templates

Ready to upload? Make sure you've uploaded your ontologies in Settings first, then head to your project and click "Upload Data" to get started!