CSV Import
The CSV Import feature (/import) lets you bulk-import audience topics from spreadsheet files. The 4-step wizard guides you through uploading your file, mapping columns, choosing a classification mode, and executing the import with real-time progress tracking. The system processes up to 50,000 rows in chunked batches of 500, with automatic duplicate detection, taxonomy path parsing, and post-import review.
Preparing Your CSV File
Before importing, ensure your CSV file is properly formatted. The system is flexible with column names and can auto-detect many common formats.
Required Columns
| Column | Required | Description |
|---|---|---|
| Topic name | Yes | The primary audience topic name. This is the only strictly required column. |
Optional Columns
| Column | Description | Example |
|---|---|---|
| Keywords | Comma-separated keyword signals | "hybrid SUV, compact crossover, AWD" |
| Category | Parent category or taxonomy type hint | "Auto", "Business Technology" |
| Segment type | B2B, B2C, B2B2C, B2E, or B2G | "B2C" |
| External ID | Your source system's identifier for this topic | "EXT-12345" |
| Source | Where this topic originated | "Data Alliance", "Experian" |
| Taxonomy path | Full taxonomy path using > separator | "Automotive > Auto > Electric Vehicles > Tesla" |
Column names are matched flexibly during the column mapping step. The system recognizes common variations like "name", "topic", "topic_name", "segment_name" for the topic name column.
Taxonomy Path Format
If your CSV includes a taxonomy path column, the system automatically parses it into the hierarchy levels. The > character is used as the separator.
Provider > Taxonomy Type > Parent Category > Subcategory > Topic Name
Example paths:
Taxonomy Path
"Data Alliance > Automotive & Vehicles > Auto > Electric Vehicles > Tesla Model Y"
"Experian > Technology & Telecom > Business Technology > CRM > Salesforce"
"IAB > Consumer Goods & Retail > Food & Beverage > QSR > Chipotle"
The parser:
- Splits on the
>separator - Strips leading/trailing whitespace from each segment
- Detects and removes provider prefixes (e.g., "Data Alliance", "Experian") that are not part of the taxonomy
- Removes structural segments that match taxonomy type or parent category labels
- Extracts the leaf (last segment) as the topic name
Provider prefix stripping is automatic. If the first segment of the path matches a known data provider name, it is stripped. The system maintains a list of recognized providers.
File Requirements
| Parameter | Limit |
|---|---|
| File format | CSV (comma-separated) |
| Maximum rows | 50,000 |
| Maximum file size | No hard limit (constrained by row count) |
| Encoding | UTF-8 recommended |
| Header row | Required (first row must be column headers) |
The 4-Step Import Wizard
Step 1: Upload
Navigate to the Import page (/import) and click Upload CSV or drag and drop your file onto the upload area.
The system parses the file immediately and shows:
- Total row count
- Detected column headers
- A preview of the first few rows
If your file exceeds 50,000 rows, only the first 50,000 will be imported. A warning message indicates how many rows were truncated.
Step 2: Column Mapping
Map each column in your CSV to an AudienceGPT field. The system attempts to auto-detect mappings based on column header names, but you can adjust any mapping manually.
| Your CSV Column | Maps To |
|---|---|
| "name", "topic", "topic_name", "segment_name" | Topic Name |
| "keywords", "tags", "keyword" | Keywords |
| "category", "parent_category", "taxonomy" | Category |
| "segment", "segment_type", "type" | Segment Type |
| "external_id", "id", "source_id" | External ID |
| "source", "provider", "data_source" | Source |
| "path", "taxonomy_path", "full_path" | Taxonomy Path |
For each column, select the target field from the dropdown, or choose "Skip" to ignore a column.
The Topic Name mapping is required. The import cannot proceed without at least one column mapped to Topic Name.
Step 3: Classification Mode
Choose how imported topics should be classified:
| Mode | Description | Speed | Cost | Best For |
|---|---|---|---|---|
| Rule-Based | Deterministic local classification | Very fast | Free | Large imports, well-known categories |
| AI-Powered | Claude Sonnet 4.6 with optional web search | Slower | Per-topic API cost | Ambiguous topics, brand verification |
When AI-powered mode is selected:
- New-to-global topics (not already in the global catalog) are classified via the AI
- Topics that match existing global catalog entries are adopted without AI classification
- Duplicate topics skip AI classification entirely
- A maximum of 10 topics per chunk are classified via AI to prevent route timeouts (the
IMPORT_MAX_LLM_PER_CHUNKlimit)
For large imports, consider using rule-based mode first, then selectively reclassifying ambiguous topics with AI afterward. This minimizes cost while ensuring accuracy where it matters most.
Step 4: Execute
Click Start Import to begin. The system creates an import batch and begins processing.
Chunked Processing Architecture
Imports are processed in chunks of 500 rows to ensure reliability and enable progress tracking. Here is how it works:
Parse CSV → Create batch (POST /api/import)
→ Sequential chunks of 500 rows (POST /api/import/{batchId}/chunk)
→ Each chunk: classify → embed → deduplicate → batch INSERT
→ Import complete
Processing Pipeline Per Chunk
For each 500-row chunk, the system:
- Classifies each topic through the 7-layer engine (rule-based or AI depending on your mode selection)
- Generates embeddings -- 256-dimensional vectors for duplicate detection
- Checks for duplicates -- Compares against existing topics using cosine similarity (95% threshold blocks, 75% warns) plus brand alias matching
- Enriches duplicates -- If a topic already exists, the system enriches the existing record with new metadata (external ID, source) using a COALESCE pattern rather than creating a redundant entry
- Batch inserts new topics into the database
Idempotency
Each chunk tracks its completion status. If a chunk is accidentally re-sent (e.g., due to a network retry), the system detects it and returns { skipped: true } without creating duplicate records.
Progress Tracking
During import, a progress bar shows real-time status:
- Chunks completed out of total (e.g., "12 / 20 chunks")
- Topics processed -- Running count of classified topics
- New topics -- Topics added to your Library
- Duplicates -- Topics that matched existing records (metadata enriched)
- Errors -- Any topics that failed classification
- Estimated time remaining
You can cancel an in-progress import at any time. Cancellation is graceful -- topics from already-completed chunks remain in your Library, but no further chunks are processed.
The import status can be polled at any time via GET /api/import/{batchId}/status. If you navigate away from the page during import, you can return later to check the result.
Retry Behavior
If a chunk fails (network error, server timeout, etc.), the system automatically retries up to 3 times with exponential backoff:
| Attempt | Wait Before Retry |
|---|---|
| 1st retry | ~1 second |
| 2nd retry | ~2 seconds |
| 3rd retry | ~4 seconds |
After 3 failed attempts, the chunk is marked as failed and the import continues with the next chunk. Failed chunks are reported in the final import summary.
Duplicate Detection During Import
The import pipeline uses the same dual-layer duplicate detection as single-topic classification:
- Semantic similarity (embeddings) -- Each imported topic is compared against all existing topics in your Library. Topics with 95%+ cosine similarity are treated as duplicates and are not re-inserted.
- Brand alias matching -- Known brand aliases (e.g., "Chevy" / "Chevrolet") are caught deterministically.
When a duplicate is found, instead of skipping the row entirely, the system enriches the existing topic's metadata:
- The
external_idfrom the import is applied if the existing topic does not have one - The
sourcefield is updated - Other metadata fields are merged using a COALESCE pattern (existing values are preserved, empty fields are filled)
This means importing the same file twice will not create duplicates but will ensure all metadata is as complete as possible.
Post-Import Review
After the import completes, you enter the import review step in the chatbot. This lets you review each imported topic one at a time and take quick actions:
Quick Actions
| Action | Description |
|---|---|
| Keep | Accept the topic as classified -- no changes needed |
| Skip | Remove the topic from your Library |
| Rename | Edit the topic name (the classification is preserved) |
| Field Edit | Modify specific classification fields (category, segment type, keywords) |
Fix All Names (Batch Action)
If many imported topics have naming issues (e.g., they still contain provider prefixes, structural segments, or formatting artifacts), use the Fix All Names batch action. This sends all remaining unreviewed topics to the AI for name cleanup in a single operation.
The AI applies the following fixes:
- Strips provider prefixes (e.g., "Data Alliance - " prefix removed)
- Removes structural taxonomy segments that were incorrectly included in the name
- Fixes CSV quoting artifacts (e.g., extra quotes or escaped characters)
- Normalizes capitalization and spacing
Fix All Names is the fastest way to clean up large imports. It processes all remaining topics at once rather than requiring you to review each one individually.
Review Scoring
Each imported topic receives a review score based on name quality heuristics. Topics with low scores (indicating potential naming issues) are surfaced first in the review queue, so you address the most problematic imports first.
Import History
All past imports are tracked in the import history. You can view:
- Batch ID -- Unique identifier for each import
- Date -- When the import was executed
- Row count -- Total rows in the original file
- Results -- New topics, duplicates, errors, and skipped
- Classification mode -- Whether AI-powered or rule-based was used
- Status -- Completed, cancelled, or partially failed
Access import history from the Import page to review past operations or re-import files with different settings.
Troubleshooting
Common Issues
| Problem | Cause | Solution |
|---|---|---|
| "Maximum 50,000 rows allowed" | File exceeds the row limit | Split your file into multiple CSVs of 50,000 rows or fewer |
| Column mapping not auto-detected | Unusual column header names | Manually map columns in Step 2 of the wizard |
| Many duplicates detected | Re-importing previously imported topics | This is expected behavior. Existing topics have their metadata enriched. |
| Chunk processing timeout | AI-powered mode on complex topics | Switch to rule-based mode. AI mode is limited to 10 LLM calls per chunk to prevent timeouts. |
| Topics have provider prefix in name | Taxonomy path not parsed correctly | Check that your taxonomy path uses > as the separator. Provider prefix stripping requires the correct path format. |
| Import shows "cancelled" | You or another user cancelled the import | Completed chunks are preserved. Restart the import with remaining data if needed. |
| CSV parsing errors | File encoding issues | Ensure your file is saved as UTF-8. Avoid special characters in column headers. |
Re-importing After Errors
If an import partially fails:
- Review the import summary to see which chunks succeeded and which failed.
- The topics from completed chunks are already in your Library.
- You can re-import the same file -- duplicate detection prevents double-counting, and only the previously failed topics will be newly processed.
The idempotent chunk processing means re-importing is always safe. You will never create duplicate topics by running the same import twice.
Next Steps
- Library Management -- Browse and manage your imported topics
- Classification Deep Dive -- Understand the 7-layer classification applied during import
- Matrix Generation -- Generate combinatorial taxonomies instead of importing them
- Campaign Brief Analysis -- Upload briefs for AI-recommended topics