Overview
Duplicate detection is not a single operation. Exact duplicates — identical rows — are trivial to find. Near-duplicates — same entity, slightly different representation — require fuzzy matching, blocking strategies, and confidence thresholds. Entity-level duplicates — same real-world object represented under different identifiers — require domain knowledge and probabilistic record linkage.
Most deduplication implementations stop at exact matching. The result is a dataset that appears clean but contains the same customer under "John Smith" and "J. Smith", the same product under two slightly different SKUs, the same transaction recorded twice with a one-second timestamp difference. These duplicates corrupt aggregations, inflate counts, and bias models.
The Duplicate Detection & Deduplication Prompt generates a complete deduplication system: exact match rules, fuzzy matching configuration, blocking strategy for scale, confidence scoring, resolution rules, and a survivorship policy that determines which record to keep.
What you get: - Exact match rule specification - Fuzzy matching configuration by field type - Blocking strategy for computational efficiency - Confidence scoring and threshold calibration - Survivorship policy for record retention - Audit trail for all deduplication decisions
Built for: data engineers, analysts, and data scientists handling entity resolution, customer deduplication, and record linkage at any scale.