Skip to main content

Dedup & Merge

The dedup and merge system identifies near-duplicate topics in the global catalog using pgvector embedding similarity, groups them into clusters with a designated winner and losers, and merges them with a full cascade that transfers org links, activations, and history. All merges use a soft-delete pattern (merged_into column) rather than hard deletes, making rollback possible.

This guide covers how duplicate detection works, how to review and confirm clusters, what happens during a merge cascade, and how to roll back a merge if needed.

How Duplicate Detection Works

AudienceGPT uses a two-layer approach to detect duplicates:

1. Embedding Similarity (pgvector)

Every topic has a 256-dimensional hash embedding stored in the topics.embedding column. These embeddings are generated from a composite text string that includes the topic name, parent category, taxonomy type, subcategory, segment type, and keywords.

Similarity is measured using cosine distance via the <=> operator in PostgreSQL with pgvector. The similarity score is calculated as 1 - cosine_distance, producing a value between 0 (completely different) and 1 (identical).

ThresholdActionDescription
>= 0.95Block / FlagTopics are near-identical and should be merged
>= 0.75WarnTopics are similar and may be duplicates
< 0.75AllowTopics are sufficiently different

An HNSW index on topics.embedding accelerates similarity searches, enabling sub-second neighbor lookups even across tens of thousands of topics.

2. Brand Alias Dictionary

In addition to embedding similarity, the system maintains a deterministic brand alias dictionary for known brand name variations (e.g., "Microsoft" / "MSFT" / "Microsoft Corp"). This catches duplicates that embedding similarity might miss due to different phrasing.

Dedup Sweep Job

A dedup sweep is a background job that systematically scans topics changed by a reclassification job and identifies duplicate clusters.

Starting a Dedup Sweep

After a reclassify job completes, the admin UI offers the option to run a dedup sweep:

  1. Navigate to Admin > Topics > Dedup tab
  2. Select the source reclassify job (the sourceBatchId)
  3. Click Start Dedup Sweep

The system creates a background job of type dedup_sweep.

API: POST /api/admin/jobs

{
"jobType": "dedup_sweep",
"config": {
"sourceBatchId": "job_1234567890_abc123"
}
}

Sweep Processing

The dedup sweep handler processes topics in batches of 50 (DEDUP_BATCH_SIZE):

  1. Fetch batch: Retrieves distinct topic_id values from the topic_history table where batch_id matches the source reclassify job
  2. Find neighbors: For each topic, queries the embedding index for topics with similarity >= 0.95 (limit 5 neighbors per topic)
  3. Determine winner: Compares org adoption counts (org_topics references) -- the topic with more org links becomes the winner
  4. Deduplicate pairs: Tracks seen pairs to prevent duplicate cluster entries (A:B and B:A)
  5. Accumulate clusters: Builds up the cluster list across batches, storing intermediate results in the job's result JSONB field

The sweep completes when all topics from the source batch have been checked. The final result contains all discovered duplicate clusters.

Sweep Results

When the sweep completes, the job's result field contains:

{
"clusters": [
{
"winnerId": "topic_abc",
"winnerName": "Salesforce CRM",
"loserIds": ["topic_xyz"],
"loserNames": ["Salesforce CRM Platform"],
"similarity": 0.97
}
]
}

Reviewing Duplicate Clusters

Dedup Cluster Review

After a dedup sweep completes, review the clusters in the Dedup tab:

  1. Each cluster shows the winner (kept) and losers (to be merged)
  2. The similarity score is displayed for each pair
  3. Org adoption counts help you understand the impact of merging
  4. You can:
    • Confirm all clusters -- proceed with merge for all detected duplicates
    • Remove clusters -- exclude specific pairs you want to keep separate
    • Swap winner/loser -- if the algorithm chose the wrong winner
warning

Review clusters carefully before merging. While merges can be rolled back, the rollback process is more complex than simply not merging in the first place. Pay special attention to clusters where the similarity is close to the 0.95 threshold.

Merge Operation

Starting a Merge

After reviewing clusters, initiate the merge:

  1. Confirm the clusters you want to merge
  2. Click Start Merge
  3. The system creates a background job of type merge

API: POST /api/admin/jobs

{
"jobType": "merge",
"config": {
"clusters": [
{
"winnerId": "topic_abc",
"winnerName": "Salesforce CRM",
"loserIds": ["topic_xyz"],
"loserNames": ["Salesforce CRM Platform"],
"similarity": 0.97
}
]
}
}

Merge Cascade

For each cluster, the merge handler processes loser topics in batches of 10 (MERGE_BATCH_SIZE). Here is exactly what happens for each loser topic being merged into the winner:

Step 1: Transfer Org Topics

For each org that has adopted the loser topic:

  • If the org does NOT have the winner: The org topic's global_topic_id is repointed from the loser to the winner. The org retains its link seamlessly.
  • If the org ALREADY has the winner: The system consolidates:
    1. Transfers segment_activations from the loser's org topic to the winner's org topic
    2. Transfers org_topic_history entries from the loser's org topic to the winner's org topic
    3. Merges metadata using COALESCE: external_id, source (fills gaps), performance_score (keeps the higher value)
    4. Deletes the loser's org topic record

Step 2: Soft-Delete the Global Topic

The loser's global topic gets merged_into set to the winner's ID:

UPDATE topics SET merged_into = $winnerId, updated_at = NOW()
WHERE id = $loserId

The loser topic is not deleted. It remains in the database with merged_into pointing to the winner, allowing rollback and historical reference.

Step 3: Record History

A history entry is created with changeType = 'merged' and metadata = { winnerId }, linking to the merge job's batchId for potential rollback.

Soft-Delete Pattern

All standard queries filter merged topics:

WHERE merged_into IS NULL

This is applied consistently across:

  • Topic browser queries
  • Catalog search
  • Statistics and metrics
  • Dedup sweep neighbor searches
  • Export queries
info

Merged topics still consume storage space and appear in raw database queries. They are invisible to the application but retained for audit trail and rollback purposes.

Rollback

If a merge produced undesirable results, you can roll it back.

Rolling Back a Merge Job

  1. Navigate to Admin > Background Jobs
  2. Find the completed merge job
  3. Click Rollback

API: POST /api/admin/jobs/[mergeJobId]/rollback

This creates a new reclassify job with config.rollbackOf set to the original merge job's ID.

How Rollback Works

The rollback handler:

  1. Queries topic_history for all entries with batch_id = <mergeJobId> and change_type = 'merged'
  2. For each entry, reads the previous_values snapshot
  3. Restores the loser topic by:
    • Clearing merged_into
    • Restoring all classification fields from the snapshot
    • Regenerating the embedding from restored values
    • Recording a rolled_back history entry
Rollback Limitations

Merge rollback restores the global topic's merged_into flag and classification data. However, org topic transfers are not automatically reversed. If org topics were consolidated (both the winner and loser existed in the same org), the deletion of the loser's org topic is permanent. In practice, this means:

  • Org topics that were repointed (org only had the loser) will remain pointed at the winner
  • Org topics that were deleted during consolidation cannot be automatically restored
  • Manual intervention may be needed to fully reverse complex merges

Rollback Processing

Rollback processes in batches of 50 (RECLASSIFY_BATCH_SIZE), using the same CRON-driven chunk processing as other admin jobs. Each topic:

  1. Has its previous_values read from the history entry
  2. Gets a new embedding generated from the restored field values
  3. Is updated via updateGlobalTopic() with the restored data
  4. Gets a new history entry with changeType = 'rolled_back'

Manual Topic Search (for Merge)

The search endpoint supports finding specific topics for manual merge operations:

API: GET /api/admin/topics/search?q=<term>&limit=25

Returns lightweight results with:

  • id, topicName, parentCategory, taxonomyType, segmentType, subcategory

Minimum search term length is 2 characters. Results are filtered to exclude merged and archived topics.

Best Practices

  1. Always run a dedup sweep after bulk reclassification -- reclassification can change embeddings enough to create new near-duplicates
  2. Review clusters with similarity between 0.95 and 0.97 carefully -- these are on the boundary and may represent legitimately distinct topics
  3. Check org adoption counts before merging -- merging a topic adopted by many orgs has wider impact
  4. Prefer the topic with more org adoptions as the winner -- the system does this by default, but verify it makes sense semantically
  5. Document your merge decisions -- if you remove clusters from a sweep, note why for future reference

Next Steps