AI Processing Guide

Overview

SDCStudio uses advanced AI technology to analyze your data and automatically create comprehensive data models. This guide explains how the AI works, what it does, and how to get the best results.

The csv2pd System

csv2pd (CSV to ParsedData) is SDCStudio's intelligent data processing engine that transforms raw data files into structured, semantic data models.

Multi-Agent Architecture

The csv2pd system uses multiple specialized AI agents working together:

┌─────────────────────────────────────────┐
│         Dispatcher Agent                │
│   (Coordinates the workflow)            │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│         Profiler Agent                  │
│   (Analyzes data structure)             │
└──────────────┬──────────────────────────┘
               │
               ▼
┌────────────────┬────────────┬───────────┐
│                │            │           │
▼                ▼            ▼           ▼
Quantified    Temporal    String     Boolean
Agent         Agent       Agent      Agent
│                │            │           │
└────────────────┴────────────┴───────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│       Synthesizer Agent                 │
│   (Combines results into model)         │
└─────────────────────────────────────────┘

Agent Responsibilities

1. Dispatcher Agent

Role: Workflow coordinator

Responsibilities: - Receives uploaded file - Creates ParsedData record - Triggers processing pipeline - Manages agent sequence - Handles error recovery

2. Profiler Agent

Role: Data analyst

Responsibilities: - Reads file contents - Identifies columns and data types - Performs statistical analysis - Detects patterns and anomalies - Creates initial data profile

What It Analyzes: - Column names and count - Data types per column - Value distributions - Missing data patterns - Unique value counts - Min/max values - Data quality metrics

3. Type-Specific Agents

Each agent handles a specific data type:

Quantified Agent (Numbers with Units): - Detects measurements: weight, height, distance - Identifies units: kg, meters, dollars - Determines precision requirements - Suggests validation ranges - Maps to XdQuantity or XdCount

Temporal Agent (Dates and Times): - Detects date/time patterns - Identifies formats: ISO 8601, US, European - Recognizes durations and intervals - Suggests temporal constraints - Maps to XdTemporal

String Agent (Text Data): - Analyzes text patterns - Detects categorical data - Identifies enumeration values - Suggests string constraints - Maps to XdString with validation

Boolean Agent (True/False): - Detects boolean patterns - Recognizes variations: true/false, yes/no, 1/0 - Maps to XdBoolean

4. Synthesizer Agent

Role: Results combiner

Responsibilities: - Collects results from all type agents - Creates flat cluster structure - Assembles complete data model - Generates component definitions - Adds metadata and documentation - Validates final model

The Agentic Workflow

ADK (Agent Development Kit) Integration

SDCStudio uses Google's ADK for advanced agent capabilities:

ADK Agents: - WorkflowAgent: Orchestrates multi-step processes - DataModelAgent: Creates and manages models - ClusterAgent: Organizes components into clusters - ComponentAgents: Specialized for each SDC4 type

ADK Tools: - ParsedDataAnalysisTool: Analyzes structured data - ComponentCreationTool: Creates model components - SemanticLinkingTool: Adds semantic definitions - ConstraintTool: Defines validation rules

Processing Phases

Phase 1: Structural Analysis (Fast)

Duration: 30 seconds - 2 minutes What Happens:

  1. File Reading: Input: customers.csv Output: Raw data structure

  2. Column Detection: ``` Columns Found:

  3. customer_id
  4. first_name
  5. last_name
  6. email
  7. signup_date
  8. status
  9. total_purchases ```

  10. Initial Type Inference: customer_id → Integer first_name → String last_name → String email → String (pattern detected) signup_date → Date status → String (categorical) total_purchases → Integer

  11. ParsedData Created: json { "dataset_name": "customers", "source_type": "csv", "columns_data": [ { "name": "customer_id", "detected_type": "integer", "sample_values": [1001, 1002, 1003] }, ... ] }

Phase 2: AI Enhancement (Comprehensive)

Duration: 1-5 minutes What Happens:

  1. Semantic Analysis: customer_id → Unique identifier (primary key) email → Contact information (requires validation) status → Account state (active/inactive)

  2. Pattern Recognition: email pattern: username@domain.tld status values: ['active', 'inactive'] date format: YYYY-MM-DD

  3. Component Creation (per column): For "email" column: ├── Type: XdString ├── Label: "Customer Email Address" ├── Max Length: 320 ├── Pattern: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ ├── Required: true └── Description: "Primary contact email for customer"

  4. Cluster Organization: customers_cluster ├── customer_id (XdCount) ├── first_name (XdString) ├── last_name (XdString) ├── email (XdString) ├── signup_date (XdTemporal) ├── status (XdString) └── total_purchases (XdCount)

  5. Model Finalization: Data Model Created: "customers" Components: 7 Clusters: 1 (root) Status: COMPLETED

Semantic Engines

SDCStudio uses multiple semantic engines to enhance AI understanding:

1. Bio-Ontology Engine

Purpose: Healthcare and biological data

Capabilities: - SNOMED CT integration - LOINC code mapping - Medical term recognition - Clinical data validation

Example:

Input: "blood_pressure_systolic"
Output: Links to SNOMED: 271649006 (Systolic blood pressure)

2. WikiData Engine

Purpose: General knowledge and entities

Capabilities: - Entity recognition - Concept linking - Relationship mapping - Multi-language support

Example:

Input: "country"
Output: Links to WikiData: Q6256 (Country)

3. RDF Generator

Purpose: Semantic web integration

Capabilities: - RDF triple generation - Ontology alignment - SPARQL query support - Knowledge graph creation

4. Rich Context Processor

Purpose: Contextual understanding

Capabilities: - Domain-specific knowledge - Business rule inference - Cross-field relationships - Data quality insights

Understanding AI Decisions

Type Selection Logic

The AI chooses component types based on:

XdCount (Integer):

Criteria:
- All values are whole numbers
- No decimal points
- Represents counts or IDs
- Range: -2^31 to 2^31-1

Examples:
- customer_id: 1001, 1002, 1003
- quantity: 5, 10, 15
- age: 25, 30, 45

XdQuantity (Decimal with Units):

Criteria:
- Decimal numbers
- Measurement values
- Units detected (kg, m, $)
- Precision requirements

Examples:
- price: 19.99, 29.99, 49.99
- weight: 1.5, 2.3, 3.7 (kg)
- distance: 5.2, 10.8, 15.3 (km)

XdString (Text):

Criteria:
- Text values
- May have patterns
- May be categorical
- Variable length

Examples:
- name: "John", "Jane", "Maria"
- email: "user@example.com"
- status: "active", "inactive"

XdTemporal (Date/Time):

Criteria:
- Date/time patterns detected
- Standard formats recognized
- Temporal logic applicable

Examples:
- signup_date: 2024-01-15
- timestamp: 2024-10-01T14:30:00Z
- duration: P7D (7 days)

XdBoolean (True/False):

Criteria:
- Binary values only
- Boolean patterns: true/false, yes/no, 1/0, T/F

Examples:
- active: true, false
- verified: yes, no
- enabled: 1, 0

Validation Rule Inference

The AI suggests validation rules based on:

Data Patterns:

Email column:
Pattern detected: username@domain.tld
Suggested rule: Email format validation

Value Ranges:

Age column:
Min: 18, Max: 95
Suggested rule: Range 18-120

Categorical Data:

Status column:
Values: ['active', 'inactive', 'pending']
Suggested rule: Enumeration constraint

Required Fields:

ID column:
No null values found
Suggested rule: Required field

Optimizing AI Results

Improve Input Data Quality

Better Column Names:

✅ Good:
- customer_email
- order_total_amount
- signup_date

❌ Bad:
- col1
- data
- field_x

Clean Data:

✅ Good:
- Consistent formats
- No mixed types
- Proper encoding
- Valid values

❌ Bad:
- Mixed date formats
- Text in number columns
- Special characters
- Inconsistent nulls

Sufficient Samples:

✅ Good:
- 10+ rows of data
- Representative values
- Edge cases included

❌ Bad:
- Only 1-2 rows
- All identical values
- No variation

Use Ontologies

Upload relevant ontologies to improve AI understanding:

  1. Healthcare Data: Upload SNOMED CT, LOINC
  2. Geographic Data: Upload GeoNames ontology
  3. Domain-Specific: Upload industry ontologies

See Semantic Enhancement Guide for details.

Provide Context

Project Industry Selection: - Helps AI apply domain knowledge - Improves semantic understanding - Better validation suggestions

File Naming:

✅ Good: customer_demographics.csv
❌ Bad: data.csv

Column Descriptions (in CSV if supported):

customer_id,first_name,email
# ID,Name,Contact Email
1001,Jane,jane@example.com

Monitoring AI Processing

Real-Time Status Updates

WebSocket Integration: - Live progress updates - No page refresh needed - Real-time error notifications - Task completion alerts

Status Indicators:

🔵 PARSING → Reading file structure
🟡 AGENT_PROCESSING → AI analyzing data
🟢 COMPLETED → Model ready
🔴 AGENT_ERROR → Processing failed (can retry)

Progress Tracking

View detailed progress in real-time:

  1. Phase Indication: Current processing phase
  2. Agent Activity: Which agent is working
  3. Completion Percentage: Overall progress
  4. Estimated Time: Time remaining

Error Handling

Automatic Retry: - Click "Retry" button for AGENT_ERROR status - System retries with optimized parameters - Most issues resolve on retry

Error Logs: - View detailed error information - Understand what went wrong - Get suggestions for fixes

Advanced Features

Automatic String Format Inference

When processing CSV data, SDCStudio automatically detects common string format patterns from sample values. This feature runs during the agentic pipeline without additional LLM calls:

Detected Patterns: - Email addresses: user@example.com - UUIDs: 550e8400-e29b-41d4-a716-446655440000 - IPv4 addresses: 192.168.1.1 - US ZIP codes: 90210 or 90210-1234 - Phone numbers: (555) 123-4567 - ISO dates: 2024-01-15 - URLs: https://example.com - SSNs: 123-45-6789 - MAC addresses: 00:1A:2B:3C:4D:5E - Hex colors: #FF5733

When a pattern is detected in the majority of sample values for a column, the system automatically pre-populates the str_fmt field with an XML Schema-compatible regex pattern. This saves manual configuration and improves data validation accuracy. You can verify any auto-generated pattern using the XML Regex Reference & Sandbox.

Custom Agent Configuration

Configure AI behavior (admin only):

LLM Model Selection: - SDCStudio uses Gemini 3.0 Flash for default generation tasks and Gemini 3.0 Pro Preview for complex analysis (CSV2PD analyst/scanner) - Balance speed vs. accuracy by selecting the appropriate model - Configure for specific domains

Processing Priorities: - Accuracy over speed - Speed over detail - Balanced approach

RAG (Retrieval-Augmented Generation)

Knowledge Base Integration: - AI queries knowledge base for context - Improves domain-specific understanding - Better suggestions and validation

How It Works:

1. User uploads healthcare data
2. AI queries medical knowledge base
3. Returns healthcare-specific suggestions
4. Creates medically accurate components

Best Practices

Data Preparation

Structure Your Data Well: - Use clear column names - Ensure consistent types - Clean data before upload - Include representative samples

Provide Context: - Name files descriptively - Select appropriate industry - Add descriptions where possible - Use standard formats

Review AI Results

Always Review: - Don't blindly accept AI suggestions - Verify types are correct - Check validation rules - Test with edge cases

Iterate and Improve: - Start with AI suggestions - Refine based on requirements - Test thoroughly - Republish as needed

Leverage Semantic Enhancement

Use Ontologies: - Upload relevant ontologies - Link components to concepts - Benefit from semantic understanding

Provide Examples: - Include sample data - Show edge cases - Demonstrate patterns

Vertex AI Safety Filters

Google Cloud Vertex AI enforces content safety filters on all LLM requests and responses. These filters cannot be disabled on Vertex AI (unlike the free Gemini API) and may affect SDCStudio's AI features in the following ways:

What Gets Filtered

Vertex AI may truncate or block responses when the prompt or expected output contains terms associated with:

  • Personally Identifiable Information (PII): Social Security Numbers, passport numbers, credit card numbers, bank account numbers, driver's license numbers
  • Sensitive contexts: Terms like "arrestee", "victim", "suspect" combined with identifier patterns
  • Other sensitive content categories: Hate speech, dangerous content, sexually explicit material, harassment

How This Affects SDCStudio

  • AI Regex Suggestion (wand button): If your XdString component's label or description contains PII-related terms, the generated regex may be truncated or incomplete. SDCStudio automatically sanitizes known PII terms from prompts, but unusual combinations may still trigger filters.
  • Atlas Chatbot: Questions about modeling PII data fields may receive incomplete answers if the response triggers safety filters.
  • Agentic Processing: Automated model generation from uploaded data may produce incomplete results for columns that contain or describe PII data.

Workarounds

  1. Use generic labels and descriptions when working with PII fields — for example, use "Identifier Code" instead of "Social Security Number" during the AI-assisted steps, then update the label afterward.
  2. Provide format: examples in your description — the structural patterns (e.g., format: 999-99-9999) help the AI even when the field name is sanitized.
  3. Review and edit AI suggestions — always verify AI-generated regex patterns, constraints, and descriptions before publishing.
  4. Write regex manually for PII fields if the AI suggestion is consistently truncated. Use the XML Regex Reference & Sandbox to find common patterns and test your regex.

Technical Details

Vertex AI returns a finish_reason of SAFETY (code 2) when content is filtered. SDCStudio logs these events at the WARNING level. You can check for safety filter issues in the web container logs:

docker logs sdcstudio_web_dev 2>&1 | grep "finish_reason"

This is a Google Cloud platform limitation, not an SDCStudio bug. For more information, see Google's Vertex AI safety filters documentation.

Troubleshooting

AI Processing Fails

Check Data Quality: - Review file format - Verify encoding (UTF-8) - Check for corruption - Simplify and retry

Retry Processing: - Click "Retry" button - System optimizes parameters - Usually resolves on second attempt

Contact Support: - If retry fails repeatedly - Include error log details - Provide sample data if possible

Unexpected Results

Review Input Data: - Check column names clarity - Verify data consistency - Look for pattern issues

Customize Components: - AI provides starting point - Refine to your requirements - Override suggestions as needed

Next Steps

Getting Help


Ready to leverage AI? Upload your data and watch the intelligent processing in action!