Dedup & Merge
The dedup and merge system identifies near-duplicate topics in the global catalog using pgvector embedding similarity, groups them into clusters with a designated winner and losers, and merges them with a full cascade that transfers org links, activations, and history. All merges use a soft-delete pattern (merged_into column) rather than hard deletes, making rollback possible.
This guide covers how duplicate detection works, how to review and confirm clusters, what happens during a merge cascade, and how to roll back a merge if needed.
How Duplicate Detection Works
AudienceGPT uses a two-layer approach to detect duplicates:
1. Embedding Similarity (pgvector)
Every topic has a 256-dimensional hash embedding stored in the topics.embedding column. These embeddings are generated from a composite text string that includes the topic name, parent category, taxonomy type, subcategory, segment type, and keywords.
Similarity is measured using cosine distance via the <=> operator in PostgreSQL with pgvector. The similarity score is calculated as 1 - cosine_distance, producing a value between 0 (completely different) and 1 (identical).
| Threshold | Action | Description |
|---|---|---|
| >= 0.95 | Block / Flag | Topics are near-identical and should be merged |
| >= 0.75 | Warn | Topics are similar and may be duplicates |
| < 0.75 | Allow | Topics are sufficiently different |
An HNSW index on topics.embedding accelerates similarity searches, enabling sub-second neighbor lookups even across tens of thousands of topics.
2. Brand Alias Dictionary
In addition to embedding similarity, the system maintains a deterministic brand alias dictionary for known brand name variations (e.g., "Microsoft" / "MSFT" / "Microsoft Corp"). This catches duplicates that embedding similarity might miss due to different phrasing.
Dedup Sweep Job
A dedup sweep is a background job that systematically scans topics changed by a reclassification job and identifies duplicate clusters.
Starting a Dedup Sweep
After a reclassify job completes, the admin UI offers the option to run a dedup sweep:
- Navigate to Admin > Topics > Dedup tab
- Select the source reclassify job (the
sourceBatchId) - Click Start Dedup Sweep
The system creates a background job of type dedup_sweep.
API: POST /api/admin/jobs
{
"jobType": "dedup_sweep",
"config": {
"sourceBatchId": "job_1234567890_abc123"
}
}
Sweep Processing
The dedup sweep handler processes topics in batches of 50 (DEDUP_BATCH_SIZE):
- Fetch batch: Retrieves distinct
topic_idvalues from thetopic_historytable wherebatch_idmatches the source reclassify job - Find neighbors: For each topic, queries the embedding index for topics with similarity >= 0.95 (limit 5 neighbors per topic)
- Determine winner: Compares org adoption counts (
org_topicsreferences) -- the topic with more org links becomes the winner - Deduplicate pairs: Tracks seen pairs to prevent duplicate cluster entries (A:B and B:A)
- Accumulate clusters: Builds up the cluster list across batches, storing intermediate results in the job's
resultJSONB field
The sweep completes when all topics from the source batch have been checked. The final result contains all discovered duplicate clusters.
Sweep Results
When the sweep completes, the job's result field contains:
{
"clusters": [
{
"winnerId": "topic_abc",
"winnerName": "Salesforce CRM",
"loserIds": ["topic_xyz"],
"loserNames": ["Salesforce CRM Platform"],
"similarity": 0.97
}
]
}
Reviewing Duplicate Clusters

After a dedup sweep completes, review the clusters in the Dedup tab:
- Each cluster shows the winner (kept) and losers (to be merged)
- The similarity score is displayed for each pair
- Org adoption counts help you understand the impact of merging
- You can:
- Confirm all clusters -- proceed with merge for all detected duplicates
- Remove clusters -- exclude specific pairs you want to keep separate
- Swap winner/loser -- if the algorithm chose the wrong winner
Review clusters carefully before merging. While merges can be rolled back, the rollback process is more complex than simply not merging in the first place. Pay special attention to clusters where the similarity is close to the 0.95 threshold.
Merge Operation
Starting a Merge
After reviewing clusters, initiate the merge:
- Confirm the clusters you want to merge
- Click Start Merge
- The system creates a background job of type
merge
API: POST /api/admin/jobs
{
"jobType": "merge",
"config": {
"clusters": [
{
"winnerId": "topic_abc",
"winnerName": "Salesforce CRM",
"loserIds": ["topic_xyz"],
"loserNames": ["Salesforce CRM Platform"],
"similarity": 0.97
}
]
}
}
Merge Cascade
For each cluster, the merge handler processes loser topics in batches of 10 (MERGE_BATCH_SIZE). Here is exactly what happens for each loser topic being merged into the winner:
Step 1: Transfer Org Topics
For each org that has adopted the loser topic:
- If the org does NOT have the winner: The org topic's
global_topic_idis repointed from the loser to the winner. The org retains its link seamlessly. - If the org ALREADY has the winner: The system consolidates:
- Transfers
segment_activationsfrom the loser's org topic to the winner's org topic - Transfers
org_topic_historyentries from the loser's org topic to the winner's org topic - Merges metadata using COALESCE:
external_id,source(fills gaps),performance_score(keeps the higher value) - Deletes the loser's org topic record
- Transfers
Step 2: Soft-Delete the Global Topic
The loser's global topic gets merged_into set to the winner's ID:
UPDATE topics SET merged_into = $winnerId, updated_at = NOW()
WHERE id = $loserId
The loser topic is not deleted. It remains in the database with merged_into pointing to the winner, allowing rollback and historical reference.
Step 3: Record History
A history entry is created with changeType = 'merged' and metadata = { winnerId }, linking to the merge job's batchId for potential rollback.
Soft-Delete Pattern
All standard queries filter merged topics:
WHERE merged_into IS NULL
This is applied consistently across:
- Topic browser queries
- Catalog search
- Statistics and metrics
- Dedup sweep neighbor searches
- Export queries
Merged topics still consume storage space and appear in raw database queries. They are invisible to the application but retained for audit trail and rollback purposes.
Rollback
If a merge produced undesirable results, you can roll it back.
Rolling Back a Merge Job
- Navigate to Admin > Background Jobs
- Find the completed merge job
- Click Rollback
API: POST /api/admin/jobs/[mergeJobId]/rollback
This creates a new reclassify job with config.rollbackOf set to the original merge job's ID.
How Rollback Works
The rollback handler:
- Queries
topic_historyfor all entries withbatch_id = <mergeJobId>andchange_type = 'merged' - For each entry, reads the
previous_valuessnapshot - Restores the loser topic by:
- Clearing
merged_into - Restoring all classification fields from the snapshot
- Regenerating the embedding from restored values
- Recording a
rolled_backhistory entry
- Clearing
Merge rollback restores the global topic's merged_into flag and classification data. However, org topic transfers are not automatically reversed. If org topics were consolidated (both the winner and loser existed in the same org), the deletion of the loser's org topic is permanent. In practice, this means:
- Org topics that were repointed (org only had the loser) will remain pointed at the winner
- Org topics that were deleted during consolidation cannot be automatically restored
- Manual intervention may be needed to fully reverse complex merges
Rollback Processing
Rollback processes in batches of 50 (RECLASSIFY_BATCH_SIZE), using the same CRON-driven chunk processing as other admin jobs. Each topic:
- Has its
previous_valuesread from the history entry - Gets a new embedding generated from the restored field values
- Is updated via
updateGlobalTopic()with the restored data - Gets a new history entry with
changeType = 'rolled_back'
Manual Topic Search (for Merge)
The search endpoint supports finding specific topics for manual merge operations:
API: GET /api/admin/topics/search?q=<term>&limit=25
Returns lightweight results with:
id,topicName,parentCategory,taxonomyType,segmentType,subcategory
Minimum search term length is 2 characters. Results are filtered to exclude merged and archived topics.
Best Practices
- Always run a dedup sweep after bulk reclassification -- reclassification can change embeddings enough to create new near-duplicates
- Review clusters with similarity between 0.95 and 0.97 carefully -- these are on the boundary and may represent legitimately distinct topics
- Check org adoption counts before merging -- merging a topic adopted by many orgs has wider impact
- Prefer the topic with more org adoptions as the winner -- the system does this by default, but verify it makes sense semantically
- Document your merge decisions -- if you remove clusters from a sweep, note why for future reference
Next Steps
- Global Topics -- Browse and manage the global catalog
- Background Jobs -- Monitor dedup and merge jobs
- Review Queue -- Handle topics that need human review after reclassification