AI Processing Guide
Overview
SDCStudio uses advanced AI technology to analyze your data and automatically create comprehensive data models. This guide explains how the AI works, what it does, and how to get the best results.
The csv2pd System
csv2pd (CSV to ParsedData) is SDCStudio's intelligent data processing engine that transforms raw data files into structured, semantic data models.
Multi-Agent Architecture
The csv2pd system uses multiple specialized AI agents working together:
┌─────────────────────────────────────────┐
│ Dispatcher Agent │
│ (Coordinates the workflow) │
└──────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Profiler Agent │
│ (Analyzes data structure) │
└──────────────┬──────────────────────────┘
│
▼
┌────────────────┬────────────┬───────────┐
│ │ │ │
▼ ▼ ▼ ▼
Quantified Temporal String Boolean
Agent Agent Agent Agent
│ │ │ │
└────────────────┴────────────┴───────────┘
│
▼
┌─────────────────────────────────────────┐
│ Synthesizer Agent │
│ (Combines results into model) │
└─────────────────────────────────────────┘
Agent Responsibilities
1. Dispatcher Agent
Role: Workflow coordinator
Responsibilities: - Receives uploaded file - Creates ParsedData record - Triggers processing pipeline - Manages agent sequence - Handles error recovery
2. Profiler Agent
Role: Data analyst
Responsibilities: - Reads file contents - Identifies columns and data types - Performs statistical analysis - Detects patterns and anomalies - Creates initial data profile
What It Analyzes: - Column names and count - Data types per column - Value distributions - Missing data patterns - Unique value counts - Min/max values - Data quality metrics
3. Type-Specific Agents
Each agent handles a specific data type:
Quantified Agent (Numbers with Units): - Detects measurements: weight, height, distance - Identifies units: kg, meters, dollars - Determines precision requirements - Suggests validation ranges - Maps to XdQuantity or XdCount
Temporal Agent (Dates and Times): - Detects date/time patterns - Identifies formats: ISO 8601, US, European - Recognizes durations and intervals - Suggests temporal constraints - Maps to XdTemporal
String Agent (Text Data): - Analyzes text patterns - Detects categorical data - Identifies enumeration values - Suggests string constraints - Maps to XdString with validation
Boolean Agent (True/False): - Detects boolean patterns - Recognizes variations: true/false, yes/no, 1/0 - Maps to XdBoolean
4. Synthesizer Agent
Role: Results combiner
Responsibilities: - Collects results from all type agents - Creates hierarchical cluster structure - Assembles complete data model - Generates component definitions - Adds metadata and documentation - Validates final model
The Agentic Workflow
ADK (Agent Development Kit) Integration
SDCStudio uses Google's ADK for advanced agent capabilities:
ADK Agents: - WorkflowAgent: Orchestrates multi-step processes - DataModelAgent: Creates and manages models - ClusterAgent: Organizes component hierarchy - ComponentAgents: Specialized for each SDC4 type
ADK Tools: - ParsedDataAnalysisTool: Analyzes structured data - ComponentCreationTool: Creates model components - SemanticLinkingTool: Adds semantic definitions - ConstraintTool: Defines validation rules
Processing Phases
Phase 1: Structural Analysis (Fast)
Duration: 30 seconds - 2 minutes What Happens:
-
File Reading:
Input: customers.csv Output: Raw data structure -
Column Detection: ``` Columns Found:
- customer_id
- first_name
- last_name
- signup_date
- status
-
total_purchases ```
-
Initial Type Inference:
customer_id → Integer first_name → String last_name → String email → String (pattern detected) signup_date → Date status → String (categorical) total_purchases → Integer -
ParsedData Created:
json { "dataset_name": "customers", "source_type": "csv", "columns_data": [ { "name": "customer_id", "detected_type": "integer", "sample_values": [1001, 1002, 1003] }, ... ] }
Phase 2: AI Enhancement (Comprehensive)
Duration: 1-5 minutes What Happens:
-
Semantic Analysis:
customer_id → Unique identifier (primary key) email → Contact information (requires validation) status → Account state (active/inactive) -
Pattern Recognition:
email pattern: username@domain.tld status values: ['active', 'inactive'] date format: YYYY-MM-DD -
Component Creation (per column):
For "email" column: ├── Type: XdString ├── Label: "Customer Email Address" ├── Max Length: 320 ├── Pattern: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ ├── Required: true └── Description: "Primary contact email for customer" -
Cluster Organization:
customers_cluster ├── customer_id (XdCount) ├── first_name (XdString) ├── last_name (XdString) ├── email (XdString) ├── signup_date (XdTemporal) ├── status (XdString) └── total_purchases (XdCount) -
Model Finalization:
Data Model Created: "customers" Components: 7 Clusters: 1 (root) Status: COMPLETED
Semantic Engines
SDCStudio uses multiple semantic engines to enhance AI understanding:
1. Bio-Ontology Engine
Purpose: Healthcare and biological data
Capabilities: - SNOMED CT integration - LOINC code mapping - Medical term recognition - Clinical data validation
Example:
Input: "blood_pressure_systolic"
Output: Links to SNOMED: 271649006 (Systolic blood pressure)
2. WikiData Engine
Purpose: General knowledge and entities
Capabilities: - Entity recognition - Concept linking - Relationship mapping - Multi-language support
Example:
Input: "country"
Output: Links to WikiData: Q6256 (Country)
3. RDF Generator
Purpose: Semantic web integration
Capabilities: - RDF triple generation - Ontology alignment - SPARQL query support - Knowledge graph creation
4. Rich Context Processor
Purpose: Contextual understanding
Capabilities: - Domain-specific knowledge - Business rule inference - Cross-field relationships - Data quality insights
Understanding AI Decisions
Type Selection Logic
The AI chooses component types based on:
XdCount (Integer):
Criteria:
- All values are whole numbers
- No decimal points
- Represents counts or IDs
- Range: -2^31 to 2^31-1
Examples:
- customer_id: 1001, 1002, 1003
- quantity: 5, 10, 15
- age: 25, 30, 45
XdQuantity (Decimal with Units):
Criteria:
- Decimal numbers
- Measurement values
- Units detected (kg, m, $)
- Precision requirements
Examples:
- price: 19.99, 29.99, 49.99
- weight: 1.5, 2.3, 3.7 (kg)
- distance: 5.2, 10.8, 15.3 (km)
XdString (Text):
Criteria:
- Text values
- May have patterns
- May be categorical
- Variable length
Examples:
- name: "John", "Jane", "Maria"
- email: "user@example.com"
- status: "active", "inactive"
XdTemporal (Date/Time):
Criteria:
- Date/time patterns detected
- Standard formats recognized
- Temporal logic applicable
Examples:
- signup_date: 2024-01-15
- timestamp: 2024-10-01T14:30:00Z
- duration: P7D (7 days)
XdBoolean (True/False):
Criteria:
- Binary values only
- Boolean patterns: true/false, yes/no, 1/0, T/F
Examples:
- active: true, false
- verified: yes, no
- enabled: 1, 0
Validation Rule Inference
The AI suggests validation rules based on:
Data Patterns:
Email column:
Pattern detected: username@domain.tld
Suggested rule: Email format validation
Value Ranges:
Age column:
Min: 18, Max: 95
Suggested rule: Range 18-120
Categorical Data:
Status column:
Values: ['active', 'inactive', 'pending']
Suggested rule: Enumeration constraint
Required Fields:
ID column:
No null values found
Suggested rule: Required field
Optimizing AI Results
Improve Input Data Quality
Better Column Names:
✅ Good:
- customer_email
- order_total_amount
- signup_date
❌ Bad:
- col1
- data
- field_x
Clean Data:
✅ Good:
- Consistent formats
- No mixed types
- Proper encoding
- Valid values
❌ Bad:
- Mixed date formats
- Text in number columns
- Special characters
- Inconsistent nulls
Sufficient Samples:
✅ Good:
- 10+ rows of data
- Representative values
- Edge cases included
❌ Bad:
- Only 1-2 rows
- All identical values
- No variation
Use Ontologies
Upload relevant ontologies to improve AI understanding:
- Healthcare Data: Upload SNOMED CT, LOINC
- Geographic Data: Upload GeoNames ontology
- Domain-Specific: Upload industry ontologies
See Semantic Enhancement Guide for details.
Provide Context
Project Industry Selection: - Helps AI apply domain knowledge - Improves semantic understanding - Better validation suggestions
File Naming:
✅ Good: customer_demographics.csv
❌ Bad: data.csv
Column Descriptions (in CSV if supported):
customer_id,first_name,email
# ID,Name,Contact Email
1001,Jane,jane@example.com
Monitoring AI Processing
Real-Time Status Updates
WebSocket Integration: - Live progress updates - No page refresh needed - Real-time error notifications - Task completion alerts
Status Indicators:
🔵 PARSING → Reading file structure
🟡 AGENT_PROCESSING → AI analyzing data
🟢 COMPLETED → Model ready
🔴 AGENT_ERROR → Processing failed (can retry)
Progress Tracking
View detailed progress in real-time:
- Phase Indication: Current processing phase
- Agent Activity: Which agent is working
- Completion Percentage: Overall progress
- Estimated Time: Time remaining
Error Handling
Automatic Retry:
- Click "Retry" button for AGENT_ERROR status
- System retries with optimized parameters
- Most issues resolve on retry
Error Logs: - View detailed error information - Understand what went wrong - Get suggestions for fixes
Advanced Features
Custom Agent Configuration
Configure AI behavior (admin only):
LLM Model Selection: - Choose between different AI models - Balance speed vs. accuracy - Configure for specific domains
Processing Priorities: - Accuracy over speed - Speed over detail - Balanced approach
RAG (Retrieval-Augmented Generation)
Knowledge Base Integration: - AI queries knowledge base for context - Improves domain-specific understanding - Better suggestions and validation
How It Works:
1. User uploads healthcare data
2. AI queries medical knowledge base
3. Returns healthcare-specific suggestions
4. Creates medically accurate components
Best Practices
Data Preparation
Structure Your Data Well: - Use clear column names - Ensure consistent types - Clean data before upload - Include representative samples
Provide Context: - Name files descriptively - Select appropriate industry - Add descriptions where possible - Use standard formats
Review AI Results
Always Review: - Don't blindly accept AI suggestions - Verify types are correct - Check validation rules - Test with edge cases
Iterate and Improve: - Start with AI suggestions - Refine based on requirements - Test thoroughly - Republish as needed
Leverage Semantic Enhancement
Use Ontologies: - Upload relevant ontologies - Link components to concepts - Benefit from semantic understanding
Provide Examples: - Include sample data - Show edge cases - Demonstrate patterns
Troubleshooting
AI Processing Fails
Check Data Quality: - Review file format - Verify encoding (UTF-8) - Check for corruption - Simplify and retry
Retry Processing: - Click "Retry" button - System optimizes parameters - Usually resolves on second attempt
Contact Support: - If retry fails repeatedly - Include error log details - Provide sample data if possible
Unexpected Results
Review Input Data: - Check column names clarity - Verify data consistency - Look for pattern issues
Customize Components: - AI provides starting point - Refine to your requirements - Override suggestions as needed
Next Steps
- Data Modeling Guide - Customize AI-generated models
- Semantic Enhancement - Improve AI with ontologies
- Uploading Data - Optimize your data uploads
Getting Help
- Troubleshooting Guide - Common issues
- Developer Documentation - Technical details
- Support: support@axius-sdc.com
Ready to leverage AI? Upload your data and watch the intelligent processing in action!