Same contact 4 times with different emails. Companies linked to the wrong domain. Lifecycle stages set in 2022 that nobody updates. Reports your sales meetings argue about. Here's the 6-step framework we use to clean a 4-year-old SMB CRM without losing a single deal record.
Get a free CRM data auditAfter auditing 50+ small-business CRMs across HubSpot, Salesforce, Pipedrive, and Zoho, the same shape keeps appearing on the wall: roughly 30% of contacts are duplicates, 20% are unreachable, lifecycle stage has lost its meaning, and somewhere around year three "the data is wrong" becomes a running joke in the Monday standup. The vendors aren't to blame — HubSpot's built-in duplicate management and Salesforce's duplicate management features both work as advertised. The problem is structural: data decays, reps move on, integrations get bolted on, marketing imports a new list, and four years later you have 18,000 contacts that nobody trusts.
The fix is not "everyone clean up your data this Friday." That has never worked at any company. The fix is a small set of automated cleanup rules — most of them written as SQL with the PostgreSQL pg_trgm extension for trigram similarity — that run continuously, surface real duplicates, preserve every deal and ticket relationship, and stop new junk from entering. Below is the exact 6-step framework, with a stripped-down version of the matching query we actually run. The same continuous-rule mindset applied to revenue data lives in our breakdown of Shopify, Stripe, and QuickBooks reconciliation.
Measure duplicates, blanks, invalid emails, conflicting fields.
Without numbers, every cleanup proposal is a guess.
Clean phones, domains, names, emails before matching.
Most "duplicates" are just one record with a typo.
Exact match on email or phone, then fuzzy match on company name.
Match too aggressively and you merge real customers.
Preserve every deal, ticket, note, and email before merging.
The history is the value of the CRM.
Connect every person to the right account using domain-derived rules.
Stop the "three contacts at three companies" bug.
Validation rules and import gates.
The cleanup has to hold, or you'll be back here in 12 months.
Postgres + pg_trgm fuzzy matching to find duplicate contacts that exact email comparison misses.
SQL · fuzzy duplicate detection-- Find duplicate HubSpot contacts using exact + fuzzy matching
-- Requires the pg_trgm extension for similarity()
WITH normalized AS (
SELECT id,
lower(trim(email)) AS email_norm,
regexp_replace(phone, '[^0-9]', '', 'g') AS phone_norm,
lower(trim(company)) AS company_norm,
created_at
FROM hubspot.contacts
WHERE email IS NOT NULL
)
SELECT a.id AS keep_id,
b.id AS merge_id,
a.email_norm,
similarity(a.company_norm, b.company_norm) AS company_sim,
a.created_at AS kept_since
FROM normalized a
JOIN normalized b
ON a.id < b.id AND (
a.email_norm = b.email_norm
OR (a.phone_norm = b.phone_norm
AND a.phone_norm <> ''
AND similarity(a.company_norm, b.company_norm) > 0.85)
)
WHERE a.created_at <= b.created_at -- keep oldest record
ORDER BY a.created_at;
We see this become urgent if any of these sound familiar:
The framework above isn't theoretical — it's a checklist. Each gate takes one or two days to install, and once installed it runs without you. Under the hood, the matching uses well-studied string-similarity measures like Levenshtein edit distance and Jaro-Winkler similarity — the same algorithms used by every major data-quality tool. The point isn't to make your CRM "perfect" — perfect data doesn't exist. The point is to make the duplicate rate, the orphan rate, and the unreachable rate knowable, monitored, and trending in the right direction.
Reps stop arguing about whose version of the contact is real. Forecasts stop swinging by 30% depending on who pulled the report. The CRM goes from "the place where data goes to die" back to what it was supposed to be: the single source of truth for your pipeline. And the access controls and audit trail you build along the way line up cleanly with what underwriters expect in a cyber insurance security assessment.
We'll profile your contact, company, deal, and ticket data — measure the duplicate rate, the orphan rate, and the unreachable rate — and send you a clear report within 48 hours showing exactly which of the 6 gates above are missing.
Review my CRM data (free)