AI Processing Guide

Overview

SDCStudio uses advanced AI technology to analyze your data and automatically create comprehensive data models. This guide explains how the AI works, what it does, and how to get the best results.

The csv2pd System

csv2pd (CSV to ParsedData) is SDCStudio's intelligent data processing engine that transforms raw data files into structured, semantic data models.

Multi-Agent Architecture

The csv2pd system uses multiple specialized AI agents working together:

┌─────────────────────────────────────────┐
│         Dispatcher Agent                │
│   (Coordinates the workflow)            │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│         Profiler Agent                  │
│   (Analyzes data structure)             │
└──────────────┬──────────────────────────┘
               │
               ▼
┌────────────────┬────────────┬───────────┐
│                │            │           │
▼                ▼            ▼           ▼
Quantified    Temporal    String     Boolean
Agent         Agent       Agent      Agent
│                │            │           │
└────────────────┴────────────┴───────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│       Synthesizer Agent                 │
│   (Combines results into model)         │
└─────────────────────────────────────────┘

Agent Responsibilities

1. Dispatcher Agent

Role: Workflow coordinator

Responsibilities: - Receives uploaded file - Creates ParsedData record - Triggers processing pipeline - Manages agent sequence - Handles error recovery

2. Profiler Agent

Role: Data analyst

Responsibilities: - Reads file contents - Identifies columns and data types - Performs statistical analysis - Detects patterns and anomalies - Creates initial data profile

What It Analyzes: - Column names and count - Data types per column - Value distributions - Missing data patterns - Unique value counts - Min/max values - Data quality metrics

3. Type-Specific Agents

Each agent handles a specific data type:

Quantified Agent (Numbers with Units): - Detects measurements: weight, height, distance - Identifies units: kg, meters, dollars - Determines precision requirements - Suggests validation ranges - Maps to XdQuantity or XdCount

Temporal Agent (Dates and Times): - Detects date/time patterns - Identifies formats: ISO 8601, US, European - Recognizes durations and intervals - Suggests temporal constraints - Maps to XdTemporal

String Agent (Text Data): - Analyzes text patterns - Detects categorical data - Identifies enumeration values - Suggests string constraints - Maps to XdString with validation

Boolean Agent (True/False): - Detects boolean patterns - Recognizes variations: true/false, yes/no, 1/0 - Maps to XdBoolean

4. Synthesizer Agent

Role: Results combiner

Responsibilities: - Collects results from all type agents - Creates hierarchical cluster structure - Assembles complete data model - Generates component definitions - Adds metadata and documentation - Validates final model

The Agentic Workflow

ADK (Agent Development Kit) Integration

SDCStudio uses Google's ADK for advanced agent capabilities:

ADK Agents: - WorkflowAgent: Orchestrates multi-step processes - DataModelAgent: Creates and manages models - ClusterAgent: Organizes component hierarchy - ComponentAgents: Specialized for each SDC4 type

ADK Tools: - ParsedDataAnalysisTool: Analyzes structured data - ComponentCreationTool: Creates model components - SemanticLinkingTool: Adds semantic definitions - ConstraintTool: Defines validation rules

Processing Phases

Phase 1: Structural Analysis (Fast)

Duration: 30 seconds - 2 minutes What Happens:

  1. File Reading: Input: customers.csv Output: Raw data structure

  2. Column Detection: ``` Columns Found:

  3. customer_id
  4. first_name
  5. last_name
  6. email
  7. signup_date
  8. status
  9. total_purchases ```

  10. Initial Type Inference: customer_id → Integer first_name → String last_name → String email → String (pattern detected) signup_date → Date status → String (categorical) total_purchases → Integer

  11. ParsedData Created: json { "dataset_name": "customers", "source_type": "csv", "columns_data": [ { "name": "customer_id", "detected_type": "integer", "sample_values": [1001, 1002, 1003] }, ... ] }

Phase 2: AI Enhancement (Comprehensive)

Duration: 1-5 minutes What Happens:

  1. Semantic Analysis: customer_id → Unique identifier (primary key) email → Contact information (requires validation) status → Account state (active/inactive)

  2. Pattern Recognition: email pattern: username@domain.tld status values: ['active', 'inactive'] date format: YYYY-MM-DD

  3. Component Creation (per column): For "email" column: ├── Type: XdString ├── Label: "Customer Email Address" ├── Max Length: 320 ├── Pattern: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ ├── Required: true └── Description: "Primary contact email for customer"

  4. Cluster Organization: customers_cluster ├── customer_id (XdCount) ├── first_name (XdString) ├── last_name (XdString) ├── email (XdString) ├── signup_date (XdTemporal) ├── status (XdString) └── total_purchases (XdCount)

  5. Model Finalization: Data Model Created: "customers" Components: 7 Clusters: 1 (root) Status: COMPLETED

Semantic Engines

SDCStudio uses multiple semantic engines to enhance AI understanding:

1. Bio-Ontology Engine

Purpose: Healthcare and biological data

Capabilities: - SNOMED CT integration - LOINC code mapping - Medical term recognition - Clinical data validation

Example:

Input: "blood_pressure_systolic"
Output: Links to SNOMED: 271649006 (Systolic blood pressure)

2. WikiData Engine

Purpose: General knowledge and entities

Capabilities: - Entity recognition - Concept linking - Relationship mapping - Multi-language support

Example:

Input: "country"
Output: Links to WikiData: Q6256 (Country)

3. RDF Generator

Purpose: Semantic web integration

Capabilities: - RDF triple generation - Ontology alignment - SPARQL query support - Knowledge graph creation

4. Rich Context Processor

Purpose: Contextual understanding

Capabilities: - Domain-specific knowledge - Business rule inference - Cross-field relationships - Data quality insights

Understanding AI Decisions

Type Selection Logic

The AI chooses component types based on:

XdCount (Integer):

Criteria:
- All values are whole numbers
- No decimal points
- Represents counts or IDs
- Range: -2^31 to 2^31-1

Examples:
- customer_id: 1001, 1002, 1003
- quantity: 5, 10, 15
- age: 25, 30, 45

XdQuantity (Decimal with Units):

Criteria:
- Decimal numbers
- Measurement values
- Units detected (kg, m, $)
- Precision requirements

Examples:
- price: 19.99, 29.99, 49.99
- weight: 1.5, 2.3, 3.7 (kg)
- distance: 5.2, 10.8, 15.3 (km)

XdString (Text):

Criteria:
- Text values
- May have patterns
- May be categorical
- Variable length

Examples:
- name: "John", "Jane", "Maria"
- email: "user@example.com"
- status: "active", "inactive"

XdTemporal (Date/Time):

Criteria:
- Date/time patterns detected
- Standard formats recognized
- Temporal logic applicable

Examples:
- signup_date: 2024-01-15
- timestamp: 2024-10-01T14:30:00Z
- duration: P7D (7 days)

XdBoolean (True/False):

Criteria:
- Binary values only
- Boolean patterns: true/false, yes/no, 1/0, T/F

Examples:
- active: true, false
- verified: yes, no
- enabled: 1, 0

Validation Rule Inference

The AI suggests validation rules based on:

Data Patterns:

Email column:
Pattern detected: username@domain.tld
Suggested rule: Email format validation

Value Ranges:

Age column:
Min: 18, Max: 95
Suggested rule: Range 18-120

Categorical Data:

Status column:
Values: ['active', 'inactive', 'pending']
Suggested rule: Enumeration constraint

Required Fields:

ID column:
No null values found
Suggested rule: Required field

Optimizing AI Results

Improve Input Data Quality

Better Column Names:

✅ Good:
- customer_email
- order_total_amount
- signup_date

❌ Bad:
- col1
- data
- field_x

Clean Data:

✅ Good:
- Consistent formats
- No mixed types
- Proper encoding
- Valid values

❌ Bad:
- Mixed date formats
- Text in number columns
- Special characters
- Inconsistent nulls

Sufficient Samples:

✅ Good:
- 10+ rows of data
- Representative values
- Edge cases included

❌ Bad:
- Only 1-2 rows
- All identical values
- No variation

Use Ontologies

Upload relevant ontologies to improve AI understanding:

  1. Healthcare Data: Upload SNOMED CT, LOINC
  2. Geographic Data: Upload GeoNames ontology
  3. Domain-Specific: Upload industry ontologies

See Semantic Enhancement Guide for details.

Provide Context

Project Industry Selection: - Helps AI apply domain knowledge - Improves semantic understanding - Better validation suggestions

File Naming:

✅ Good: customer_demographics.csv
❌ Bad: data.csv

Column Descriptions (in CSV if supported):

customer_id,first_name,email
# ID,Name,Contact Email
1001,Jane,jane@example.com

Monitoring AI Processing

Real-Time Status Updates

WebSocket Integration: - Live progress updates - No page refresh needed - Real-time error notifications - Task completion alerts

Status Indicators:

🔵 PARSING → Reading file structure
🟡 AGENT_PROCESSING → AI analyzing data
🟢 COMPLETED → Model ready
🔴 AGENT_ERROR → Processing failed (can retry)

Progress Tracking

View detailed progress in real-time:

  1. Phase Indication: Current processing phase
  2. Agent Activity: Which agent is working
  3. Completion Percentage: Overall progress
  4. Estimated Time: Time remaining

Error Handling

Automatic Retry: - Click "Retry" button for AGENT_ERROR status - System retries with optimized parameters - Most issues resolve on retry

Error Logs: - View detailed error information - Understand what went wrong - Get suggestions for fixes

Advanced Features

Custom Agent Configuration

Configure AI behavior (admin only):

LLM Model Selection: - Choose between different AI models - Balance speed vs. accuracy - Configure for specific domains

Processing Priorities: - Accuracy over speed - Speed over detail - Balanced approach

RAG (Retrieval-Augmented Generation)

Knowledge Base Integration: - AI queries knowledge base for context - Improves domain-specific understanding - Better suggestions and validation

How It Works:

1. User uploads healthcare data
2. AI queries medical knowledge base
3. Returns healthcare-specific suggestions
4. Creates medically accurate components

Best Practices

Data Preparation

Structure Your Data Well: - Use clear column names - Ensure consistent types - Clean data before upload - Include representative samples

Provide Context: - Name files descriptively - Select appropriate industry - Add descriptions where possible - Use standard formats

Review AI Results

Always Review: - Don't blindly accept AI suggestions - Verify types are correct - Check validation rules - Test with edge cases

Iterate and Improve: - Start with AI suggestions - Refine based on requirements - Test thoroughly - Republish as needed

Leverage Semantic Enhancement

Use Ontologies: - Upload relevant ontologies - Link components to concepts - Benefit from semantic understanding

Provide Examples: - Include sample data - Show edge cases - Demonstrate patterns

Troubleshooting

AI Processing Fails

Check Data Quality: - Review file format - Verify encoding (UTF-8) - Check for corruption - Simplify and retry

Retry Processing: - Click "Retry" button - System optimizes parameters - Usually resolves on second attempt

Contact Support: - If retry fails repeatedly - Include error log details - Provide sample data if possible

Unexpected Results

Review Input Data: - Check column names clarity - Verify data consistency - Look for pattern issues

Customize Components: - AI provides starting point - Refine to your requirements - Override suggestions as needed

Next Steps

Getting Help


Ready to leverage AI? Upload your data and watch the intelligent processing in action!