Skip to main content

CSV Import

The CSV Import feature (/import) lets you bulk-import audience topics from spreadsheet files. The 4-step wizard guides you through uploading your file, mapping columns, choosing a classification mode, and executing the import with real-time progress tracking. The system processes up to 50,000 rows in chunked batches of 500, with automatic duplicate detection, taxonomy path parsing, and post-import review.

Preparing Your CSV File

Before importing, ensure your CSV file is properly formatted. The system is flexible with column names and can auto-detect many common formats.

Required Columns

ColumnRequiredDescription
Topic nameYesThe primary audience topic name. This is the only strictly required column.

Optional Columns

ColumnDescriptionExample
KeywordsComma-separated keyword signals"hybrid SUV, compact crossover, AWD"
CategoryParent category or taxonomy type hint"Auto", "Business Technology"
Segment typeB2B, B2C, B2B2C, B2E, or B2G"B2C"
External IDYour source system's identifier for this topic"EXT-12345"
SourceWhere this topic originated"Data Alliance", "Experian"
Taxonomy pathFull taxonomy path using > separator"Automotive > Auto > Electric Vehicles > Tesla"
tip

Column names are matched flexibly during the column mapping step. The system recognizes common variations like "name", "topic", "topic_name", "segment_name" for the topic name column.

Taxonomy Path Format

If your CSV includes a taxonomy path column, the system automatically parses it into the hierarchy levels. The > character is used as the separator.

Provider > Taxonomy Type > Parent Category > Subcategory > Topic Name

Example paths:

Taxonomy Path
"Data Alliance > Automotive & Vehicles > Auto > Electric Vehicles > Tesla Model Y"
"Experian > Technology & Telecom > Business Technology > CRM > Salesforce"
"IAB > Consumer Goods & Retail > Food & Beverage > QSR > Chipotle"

The parser:

  1. Splits on the > separator
  2. Strips leading/trailing whitespace from each segment
  3. Detects and removes provider prefixes (e.g., "Data Alliance", "Experian") that are not part of the taxonomy
  4. Removes structural segments that match taxonomy type or parent category labels
  5. Extracts the leaf (last segment) as the topic name
info

Provider prefix stripping is automatic. If the first segment of the path matches a known data provider name, it is stripped. The system maintains a list of recognized providers.

File Requirements

ParameterLimit
File formatCSV (comma-separated)
Maximum rows50,000
Maximum file sizeNo hard limit (constrained by row count)
EncodingUTF-8 recommended
Header rowRequired (first row must be column headers)

The 4-Step Import Wizard

Step 1: Upload

Navigate to the Import page (/import) and click Upload CSV or drag and drop your file onto the upload area.

The system parses the file immediately and shows:

  • Total row count
  • Detected column headers
  • A preview of the first few rows

If your file exceeds 50,000 rows, only the first 50,000 will be imported. A warning message indicates how many rows were truncated.

Step 2: Column Mapping

Map each column in your CSV to an AudienceGPT field. The system attempts to auto-detect mappings based on column header names, but you can adjust any mapping manually.

Your CSV ColumnMaps To
"name", "topic", "topic_name", "segment_name"Topic Name
"keywords", "tags", "keyword"Keywords
"category", "parent_category", "taxonomy"Category
"segment", "segment_type", "type"Segment Type
"external_id", "id", "source_id"External ID
"source", "provider", "data_source"Source
"path", "taxonomy_path", "full_path"Taxonomy Path

For each column, select the target field from the dropdown, or choose "Skip" to ignore a column.

warning

The Topic Name mapping is required. The import cannot proceed without at least one column mapped to Topic Name.

Step 3: Classification Mode

Choose how imported topics should be classified:

ModeDescriptionSpeedCostBest For
Rule-BasedDeterministic local classificationVery fastFreeLarge imports, well-known categories
AI-PoweredClaude Sonnet 4.6 with optional web searchSlowerPer-topic API costAmbiguous topics, brand verification

When AI-powered mode is selected:

  • New-to-global topics (not already in the global catalog) are classified via the AI
  • Topics that match existing global catalog entries are adopted without AI classification
  • Duplicate topics skip AI classification entirely
  • A maximum of 10 topics per chunk are classified via AI to prevent route timeouts (the IMPORT_MAX_LLM_PER_CHUNK limit)
tip

For large imports, consider using rule-based mode first, then selectively reclassifying ambiguous topics with AI afterward. This minimizes cost while ensuring accuracy where it matters most.

Step 4: Execute

Click Start Import to begin. The system creates an import batch and begins processing.

Chunked Processing Architecture

Imports are processed in chunks of 500 rows to ensure reliability and enable progress tracking. Here is how it works:

Parse CSV → Create batch (POST /api/import)
→ Sequential chunks of 500 rows (POST /api/import/{batchId}/chunk)
→ Each chunk: classify → embed → deduplicate → batch INSERT
→ Import complete

Processing Pipeline Per Chunk

For each 500-row chunk, the system:

  1. Classifies each topic through the 7-layer engine (rule-based or AI depending on your mode selection)
  2. Generates embeddings -- 256-dimensional vectors for duplicate detection
  3. Checks for duplicates -- Compares against existing topics using cosine similarity (95% threshold blocks, 75% warns) plus brand alias matching
  4. Enriches duplicates -- If a topic already exists, the system enriches the existing record with new metadata (external ID, source) using a COALESCE pattern rather than creating a redundant entry
  5. Batch inserts new topics into the database

Idempotency

Each chunk tracks its completion status. If a chunk is accidentally re-sent (e.g., due to a network retry), the system detects it and returns { skipped: true } without creating duplicate records.

Progress Tracking

During import, a progress bar shows real-time status:

  • Chunks completed out of total (e.g., "12 / 20 chunks")
  • Topics processed -- Running count of classified topics
  • New topics -- Topics added to your Library
  • Duplicates -- Topics that matched existing records (metadata enriched)
  • Errors -- Any topics that failed classification
  • Estimated time remaining

You can cancel an in-progress import at any time. Cancellation is graceful -- topics from already-completed chunks remain in your Library, but no further chunks are processed.

info

The import status can be polled at any time via GET /api/import/{batchId}/status. If you navigate away from the page during import, you can return later to check the result.

Retry Behavior

If a chunk fails (network error, server timeout, etc.), the system automatically retries up to 3 times with exponential backoff:

AttemptWait Before Retry
1st retry~1 second
2nd retry~2 seconds
3rd retry~4 seconds

After 3 failed attempts, the chunk is marked as failed and the import continues with the next chunk. Failed chunks are reported in the final import summary.

Duplicate Detection During Import

The import pipeline uses the same dual-layer duplicate detection as single-topic classification:

  1. Semantic similarity (embeddings) -- Each imported topic is compared against all existing topics in your Library. Topics with 95%+ cosine similarity are treated as duplicates and are not re-inserted.
  2. Brand alias matching -- Known brand aliases (e.g., "Chevy" / "Chevrolet") are caught deterministically.

When a duplicate is found, instead of skipping the row entirely, the system enriches the existing topic's metadata:

  • The external_id from the import is applied if the existing topic does not have one
  • The source field is updated
  • Other metadata fields are merged using a COALESCE pattern (existing values are preserved, empty fields are filled)

This means importing the same file twice will not create duplicates but will ensure all metadata is as complete as possible.

Post-Import Review

After the import completes, you enter the import review step in the chatbot. This lets you review each imported topic one at a time and take quick actions:

Quick Actions

ActionDescription
KeepAccept the topic as classified -- no changes needed
SkipRemove the topic from your Library
RenameEdit the topic name (the classification is preserved)
Field EditModify specific classification fields (category, segment type, keywords)

Fix All Names (Batch Action)

If many imported topics have naming issues (e.g., they still contain provider prefixes, structural segments, or formatting artifacts), use the Fix All Names batch action. This sends all remaining unreviewed topics to the AI for name cleanup in a single operation.

The AI applies the following fixes:

  • Strips provider prefixes (e.g., "Data Alliance - " prefix removed)
  • Removes structural taxonomy segments that were incorrectly included in the name
  • Fixes CSV quoting artifacts (e.g., extra quotes or escaped characters)
  • Normalizes capitalization and spacing
tip

Fix All Names is the fastest way to clean up large imports. It processes all remaining topics at once rather than requiring you to review each one individually.

Review Scoring

Each imported topic receives a review score based on name quality heuristics. Topics with low scores (indicating potential naming issues) are surfaced first in the review queue, so you address the most problematic imports first.

Import History

All past imports are tracked in the import history. You can view:

  • Batch ID -- Unique identifier for each import
  • Date -- When the import was executed
  • Row count -- Total rows in the original file
  • Results -- New topics, duplicates, errors, and skipped
  • Classification mode -- Whether AI-powered or rule-based was used
  • Status -- Completed, cancelled, or partially failed

Access import history from the Import page to review past operations or re-import files with different settings.

Troubleshooting

Common Issues

ProblemCauseSolution
"Maximum 50,000 rows allowed"File exceeds the row limitSplit your file into multiple CSVs of 50,000 rows or fewer
Column mapping not auto-detectedUnusual column header namesManually map columns in Step 2 of the wizard
Many duplicates detectedRe-importing previously imported topicsThis is expected behavior. Existing topics have their metadata enriched.
Chunk processing timeoutAI-powered mode on complex topicsSwitch to rule-based mode. AI mode is limited to 10 LLM calls per chunk to prevent timeouts.
Topics have provider prefix in nameTaxonomy path not parsed correctlyCheck that your taxonomy path uses > as the separator. Provider prefix stripping requires the correct path format.
Import shows "cancelled"You or another user cancelled the importCompleted chunks are preserved. Restart the import with remaining data if needed.
CSV parsing errorsFile encoding issuesEnsure your file is saved as UTF-8. Avoid special characters in column headers.

Re-importing After Errors

If an import partially fails:

  1. Review the import summary to see which chunks succeeded and which failed.
  2. The topics from completed chunks are already in your Library.
  3. You can re-import the same file -- duplicate detection prevents double-counting, and only the previously failed topics will be newly processed.
info

The idempotent chunk processing means re-importing is always safe. You will never create duplicate topics by running the same import twice.

Next Steps