Why is my CRM full of duplicate contacts and how did it get this bad?

Every SMB CRM gets dirty. It is not a flaw of HubSpot or Salesforce — it is a flaw of the universe: data decays, reps move on, integrations get bolted on, marketing imports a new list, and four years later you have 18,000 contacts where roughly 30% are duplicates, 20% are unreachable, and the lifecycle stage is meaningless. By that point, 'the data is wrong' becomes a running joke in your Monday standup. The fix is not 'everyone clean up your data this Friday' — that has never worked at any company.

How do you clean a 4-year-old CRM without losing deal history?

We use a 6-step framework: profile the mess, normalize contact data, match duplicates carefully, protect deal and ticket history, rebuild company links, and stop new junk from entering. The most important step is preserving every deal, ticket, note, and email before merging — we reassign deal.contact_id to the survivor record before archiving, so no deal ever loses its contact link, even when the original contact disappears. The history is the value of the CRM.

How does fuzzy matching avoid merging real customers by accident?

We match exact on email or phone first, then fuzzy on company name. Merge candidates require either the same normalized email, or the same phone with company-name similarity above 0.85 using Postgres pg_trgm. Anything below threshold goes to a human review queue rather than being auto-merged. Most apparent duplicates are just one record with a typo: lowercase every email, strip punctuation from every phone, drop 'www.' from every company domain, and now matching can actually work.

How do you stop new duplicates from entering after the cleanup?

Validation rules and import gates. Reject imports where the email domain is gmail or yahoo and the company_name is blank. Flag form-fill submissions with single-character first names. The cleanup has to hold, or you will be back here in 12 months. Every cleanup rule keeps running after we leave, so the CRM stays clean instead of slowly rotting again. The duplicate rate, the orphan rate, and the unreachable rate become knowable, monitored, and trending in the right direction.

CRM · Data Cleanup · HubSpot · Salesforce

Your CRM Is Full of Duplicates, Bad Leads, and Wrong Customers (And Your Reps Stopped Trusting It)

Same contact 4 times with different emails. Companies linked to the wrong domain. Lifecycle stages set in 2022 that nobody updates. Reports your sales meetings argue about. Here's the 6-step framework we use to clean a 4-year-old SMB CRM without losing a single deal record.

Get a free CRM data audit

No commitment. No CRM admin access required. Clear report in 48 hours.

After auditing 50+ small-business CRMs across HubSpot, Salesforce, Pipedrive, and Zoho, the same shape keeps appearing on the wall: roughly 30% of contacts are duplicates, 20% are unreachable, lifecycle stage has lost its meaning, and somewhere around year three "the data is wrong" becomes a running joke in the Monday standup. The vendors aren't to blame — HubSpot's built-in duplicate management and Salesforce's duplicate management features both work as advertised. The problem is structural: data decays, reps move on, integrations get bolted on, marketing imports a new list, and four years later you have 18,000 contacts that nobody trusts.

The fix is not "everyone clean up your data this Friday." That has never worked at any company. The fix is a small set of automated cleanup rules — most of them written as SQL with the PostgreSQL pg_trgm extension for trigram similarity — that run continuously, surface real duplicates, preserve every deal and ticket relationship, and stop new junk from entering. Below is the exact 6-step framework, with a stripped-down version of the matching query we actually run. The same continuous-rule mindset applied to revenue data lives in our breakdown of Shopify, Stripe, and QuickBooks reconciliation.

Worked example Digital agency, 60 employees: 18,000 HubSpot contacts. Reps don't trust lifecycle stage, deal owner, or company linkage. Forecasting meetings have devolved into arguments about "which version of the contact is right." Each card below shows what that gate of the framework would do for that single situation.

Profile the Mess

Measure duplicates, blanks, invalid emails, conflicting fields.

Without numbers, every cleanup proposal is a guess.

Example

"We count contacts where email is null, phone is null, or company_id links to multiple domains. Result: '4,103 contacts have 2+ records sharing a phone'."

Normalize Contact Data

Clean phones, domains, names, emails before matching.

Most "duplicates" are just one record with a typo.

Example

"Lowercase every email. Strip punctuation from every phone. Drop 'www.' from every company domain. Now matching can actually work."

Match Duplicates Carefully

Exact match on email or phone, then fuzzy match on company name.

Match too aggressively and you merge real customers.

Example

"Merge candidates: same normalized email, OR same phone with company-name similarity above 0.85 (Postgres pg_trgm). Anything below threshold goes to a human review queue."

Protect Deal and Ticket History

Preserve every deal, ticket, note, and email before merging.

The history is the value of the CRM.

Example

"Reassign deal.contact_id to the survivor record before archiving. No deal ever loses its contact link, even when the original contact disappears."

Rebuild Company Links

Connect every person to the right account using domain-derived rules.

Stop the "three contacts at three companies" bug.

Example

"Derive company_domain from email. Join to canonical companies.domain. Re-link any contact whose current company doesn't match its email domain."

Stop New Junk From Entering

Validation rules and import gates.

The cleanup has to hold, or you'll be back here in 12 months.

Example

"Reject imports where email_domain is gmail/yahoo and company_name is blank. Flag form-fill submissions with single-character first names."

What this looks like in practice

Postgres + pg_trgm fuzzy matching to find duplicate contacts that exact email comparison misses.

SQL · fuzzy duplicate detection-- Find duplicate HubSpot contacts using exact + fuzzy matching
-- Requires the pg_trgm extension for similarity()
WITH normalized AS (
  SELECT id,
         lower(trim(email))                            AS email_norm,
         regexp_replace(phone, '[^0-9]', '', 'g')      AS phone_norm,
         lower(trim(company))                          AS company_norm,
         created_at
  FROM hubspot.contacts
  WHERE email IS NOT NULL
)
SELECT a.id AS keep_id,
       b.id AS merge_id,
       a.email_norm,
       similarity(a.company_norm, b.company_norm) AS company_sim,
       a.created_at AS kept_since
FROM normalized a
JOIN normalized b
  ON a.id < b.id AND (
       a.email_norm = b.email_norm
    OR (a.phone_norm = b.phone_norm
        AND a.phone_norm <> ''
        AND similarity(a.company_norm, b.company_norm) > 0.85)
  )
WHERE a.created_at <= b.created_at  -- keep oldest record
ORDER BY a.created_at;

The pattern we keep seeing Every duplicate is found by a deterministic rule, not a hunch. Every merge preserves deal and ticket history. Every cleanup rule keeps running after we leave — so the CRM stays clean instead of slowly rotting again.

DeterministicMatching

HistoryPreserved

RulesThat Last

What tells us it's time to step in

We see this become urgent if any of these sound familiar:

Your reports show different totals depending on who runs them.
Sales meetings have devolved into arguments about which contact record is "real".
You haven't trusted lifecycle stage or deal owner in months.
An import or migration broke something nobody has fully traced — the exact pattern our vendor data migration framework exists to prevent.
You'd happily pay to have your CRM make sense by Monday morning.

The framework above isn't theoretical — it's a checklist. Each gate takes one or two days to install, and once installed it runs without you. Under the hood, the matching uses well-studied string-similarity measures like Levenshtein edit distance and Jaro-Winkler similarity — the same algorithms used by every major data-quality tool. The point isn't to make your CRM "perfect" — perfect data doesn't exist. The point is to make the duplicate rate, the orphan rate, and the unreachable rate knowable, monitored, and trending in the right direction.

Reps stop arguing about whose version of the contact is real. Forecasts stop swinging by 30% depending on who pulled the report. The CRM goes from "the place where data goes to die" back to what it was supposed to be: the single source of truth for your pipeline. And the access controls and audit trail you build along the way line up cleanly with what underwriters expect in a cyber insurance security assessment.

Get a free 30-minute review of your CRM data

We'll profile your contact, company, deal, and ticket data — measure the duplicate rate, the orphan rate, and the unreachable rate — and send you a clear report within 48 hours showing exactly which of the 6 gates above are missing.

Review my CRM data (free)

No CRM admin access required. No commitment. No pressure.

What this looks like in practice

What tells us it's time to step in

Related Reading

Get a free 30-minute review of your CRM data