What is CRM data normalization?

CRM data normalization is the process of translating disparate data models from different CRM platforms (Salesforce, HubSpot, Pipedrive, etc.) into a single, canonical JSON schema. It abstracts away provider-specific endpoints, field naming conventions, and API behaviors so developers can interact with one standard data model.

What are the biggest differences between Salesforce and HubSpot data models?

Salesforce uses PascalCase fields, a separate Lead/Contact object split, SOQL for filtering, and a `__c` suffix for custom fields. HubSpot uses a flat `properties` object, no separate Lead entity (using a Lifecycle Stage property instead), `filterGroups` for search, and stores custom fields alongside standard properties.

Why do rigid unified APIs fail with custom CRM fields?

Rigid unified APIs rely on static 1-to-1 key-value mapping that only covers standard fields. When a customer uses custom fields or objects unique to their CRM instance, the rigid abstraction drops that data, forcing engineers to build custom point-to-point integrations anyway.

How do you handle custom fields when normalizing CRM data?

Use a multi-level override system that allows mappings to be customized at platform, environment, and individual account levels. Each level deep-merges on top of the previous one, so customers can add their own custom field mappings without requiring code changes or deployments.

How does JSONata help with API integrations?

JSONata is a functional query and transformation language for JSON that acts as a universal transformation engine. It lets you map complex JSON responses and queries declaratively without writing hardcoded execution logic, making integration behavior a data operation rather than a code deployment.

What is the Best Way to Normalize Data Models Across Different CRMs?

CRM data normalization is the practice of translating disparate data structures from platforms like Salesforce, HubSpot, Pipedrive, and Close into a single, canonical schema your application can rely on. If you're a product manager evaluating how to ship integrations across multiple CRMs without drowning your engineering team, the short answer is: treat integration behavior as data configuration, not hardcoded logic. The same principle applies beyond CRMs - if you need to pull a unified list of users from every SaaS app your customers use without building each integration one by one, the normalization architecture is identical. This article breaks down exactly why that matters and how to do it.

Why CRM Data Normalization is a Nightmare

The problem starts with a deceptively simple question: "What is a Contact?"

In HubSpot, a Contact is a flat object. Data lives in a properties bag — properties.firstname, properties.lastname, properties.email. Timestamps are tracked via properties.hs_lastmodifieddate. HubSpot doesn't even have a separate Lead object — it uses a Lifecycle Stage property on the Contact to track progression from lead to customer.

Salesforce takes a fundamentally different approach. A Contact is a PascalCase object (FirstName, LastName, Email) that sits inside a sprawling web of standard and custom object relationships. Salesforce does have a separate Lead object, and when a Lead converts to a Contact, the original Lead record is destroyed. Filtering uses SOQL (a SQL dialect), modification tracking is via LastModifiedDate, and custom fields are identified by a __c suffix.

These aren't cosmetic differences. They represent fundamentally different opinions about how to model customer relationships. And every CRM in the market has its own opinion.

The financial stakes are real. Gartner estimates poor data quality costs the average enterprise $12.9 to $15 million annually. Research suggests duplication rates between 10-30% are normal for companies without active data quality programs, and integrations that don't normalize and deduplicate correctly are a primary driver of that problem. Plauti's analysis of 12 billion Salesforce records found 45% were duplicates across organizations, and that rate jumps to 80% for records created via API integrations.

Meanwhile, the integration surface area keeps expanding. Companies used an average of 106 SaaS applications in 2024, with large enterprises running up to 131 different apps. Your product must integrate with an increasingly fragmented software landscape, and your CRM — supposed to be your system of record — corrodes when integration data flows in without proper normalization.

Understanding why schema normalization is the hardest problem in SaaS integrations is the first step. The second is architecting a system that actually solves it.

Beyond CRMs: Pulling Unified User Data from Every SaaS App

The normalization problem doesn't stop at CRMs. If you're building a B2B product and need to pull a list of users from every SaaS app your customers use - without building each integration individually - you're facing the same structural challenge at a much wider scale.

Think about what "user data" looks like across a typical customer's stack. Their identity provider (Okta, Azure AD) has users with roles and group memberships. Their HRIS (Workday, BambooHR) has employees with departments, managers, and employment status. Their CRM has contacts and account owners. Their ticketing system (Jira, Zendesk) has agents. Every one of these systems stores people records in its own format, with its own ID scheme, and its own API conventions.

The unified API approach described in this article applies directly to this problem. A canonical user schema standardizes the core entities - Users, Groups, Roles, Licenses, and Activities - across all these tools. Per-integration mappings translate each app's native format into that canonical shape. The same configuration-driven engine handles the API differences (auth, pagination, rate limits) without integration-specific code.

The real payoff is operational. With unified user data across SaaS tools, you can programmatically manage onboarding and offboarding across all connected apps, synchronize role-based access control by mapping application permissions to a standard roles model, audit license utilization to identify unused seats, and ingest activity logs into a single dashboard for security monitoring. All from one API surface, regardless of how many SaaS apps are connected underneath.

But pulling user records from 10 different apps into one list creates a new problem: deduplication. The same person exists as john.doe@acme.com in Okta, John Doe in Salesforce, and employee ID EMP-4472 in Workday. Matching these records into a single canonical identity requires more than field mapping - it requires matching logic, normalization rules, and conflict resolution. We cover exactly how to do that in the deduplication appendix below.

The Traditional Approaches (And Why They Break)

When faced with schema normalization, engineering teams usually default to one of two flawed architectures.

Point-to-Point Integrations: The Spaghetti Factory

The most common first attempt is building direct integrations. Your team writes a Salesforce adapter, then a HubSpot adapter, then Pipedrive, then Zoho. Each has its own code path:

// This is how most teams start. It doesn't end well.
if (provider === 'hubspot') {
  contact.firstName = data.properties.firstname;
  contact.email = data.properties.email;
} else if (provider === 'salesforce') {
  contact.firstName = data.FirstName;
  contact.email = data.Email;
} else if (provider === 'pipedrive') {
  contact.firstName = data.first_name;
  contact.email = data.email[0].value;
}

This works for two CRMs. By the fifth, you're maintaining a growing tangle of conditional logic, each branch with its own pagination strategy, auth flow, rate-limit handling, and undocumented edge cases. Every time a vendor deprecates an endpoint, alters a rate limit, or changes a pagination cursor, you have to write code, review it, and deploy it.

The cost isn't trivial. Adding third-party integrations and custom features to a SaaS platform typically pushes development costs into the $50,000 to $150,000+ range per integration when you account for initial build, testing, and ongoing maintenance. You are essentially paying top-tier engineers to read terrible vendor documentation.

Rigid Unified APIs: The Lowest-Common-Denominator Trap

To escape the maintenance burden of custom code, teams turn to standard unified APIs. The promise is alluring: write to one common data model, and the provider handles the translation.

The problem: most of these platforms implement what amounts to a rigid key-value mapping layer. They map vendor_field_A to unified_field_B and call it done. This works for the 20% of fields that are standard across CRMs. But what about the other 80%?

Every serious Salesforce deployment has custom fields (Lead_Score__c, Region__c, Contract_Value__c). Every HubSpot instance has custom properties. A unified API that strips away custom fields to maintain a clean common model is losing the data your customers care about most.

This is what we call the implementation gap — the distance between what a unified API's common model covers and what your customers actually need. If closing that gap requires filing support tickets and waiting for the provider to update their mappings, you've traded one bottleneck (building integrations) for another (waiting on a vendor). Your customers complain that the integration is missing critical data, and you end up building custom point-to-point integrations anyway.

What is the Best Way to Normalize Data Models Across Different CRMs?

The best approach is a declarative, configuration-driven architecture where:

A canonical schema defines what your application sees (the unified interface)
Per-integration mappings define how each CRM's native format translates to and from that schema (the transformation layer)
A generic execution engine processes both without any integration-specific code

This means adding a new CRM integration or modifying how a field maps is a data operation — not a code deployment. This approach decouples the integration logic from your application runtime, allowing you to standardize the core entities while maintaining the flexibility to handle custom fields dynamically. CRM data mapping and schema alignment stop being one-off engineering projects and become a repeatable configuration exercise. Here's how to implement each piece.

1. Standardize the Core CRM Trinity

Before writing a single mapping, you need a well-designed canonical data model. For CRMs, this centers on what we call the Core Trinity: Accounts, Contacts, and Opportunities.

Unified Entity	Salesforce	HubSpot	Pipedrive	Close
Account	Account	Company	Organization	Lead (organization-level)
Contact	Contact (+ Lead)	Contact	Person	Contact
Opportunity	Opportunity	Deal	Deal	Opportunity

Your canonical schema should model the relationships between these entities, not just the fields:

Account: The primary container for customer data (the company).
Contact: The people who work at that Account.
Opportunity: The active deals being negotiated with that Account.
Lead: Unqualified prospects that exist independently at the top of the funnel.

When qualified, a Lead is typically converted into a Contact, an Account, and an Opportunity. Then comes the pipeline model: an Opportunity belongs to a specific Pipeline, and its status is defined by a Stage within that pipeline. Users generate Engagements, Notes, and Tasks that are appended to the relevant entity to create a historical timeline.

erDiagram
    Account ||--o{ Contact : employs
    Account ||--o{ Opportunity : has
    Opportunity }o--|| Pipeline : belongs_to
    Opportunity }o--|| Stage : current_status
    Lead ||--o| Contact : converts_to
    Lead ||--o| Account : converts_to
    Lead ||--o| Opportunity : converts_to
    User ||--o{ Engagement : creates
    User ||--o{ Note : creates
    User ||--o{ Task : creates

Getting this relational model right matters more than getting every field mapped from day one. If your canonical schema has sound entity relationships, you can always add fields later. If the relationships are wrong, every downstream consumer breaks. By standardizing this relationship graph, your application only needs to understand one data shape.

2. Handle Custom Fields and Objects Dynamically

Standardizing the core trinity is the easy part. The real challenge is handling the long tail of custom fields — and it's where most normalization strategies fail.

Salesforce identifies custom fields by the __c suffix. HubSpot stores them as additional keys in the properties object. Pipedrive uses numeric custom field IDs. You need a normalization strategy that handles all three patterns without losing data.

The worst thing you can do is ignore them. Custom fields contain the business-specific data your customers configured their CRM around — deal scoring, contract types, region codes, compliance flags. Dropping them means your integration is only half-useful. Learning how to handle custom fields and custom objects in Salesforce via API highlights why this flexibility is mandatory.

The solution is a multi-layered override system that lets mappings be customized at different levels without requiring your engineers to write custom code:

Platform level — the default mapping that works for 80% of use cases.
Environment level — overrides for specific customer deployments (e.g., a customer's Salesforce instance with a non-standard field layout).
Account level — overrides for individual connected accounts (e.g., one customer's Contract_Value__c field maps to a unified deal_value field).

flowchart TD
    A["Platform Base Mapping<br>(default for all customers)"] --> B["Environment Override<br>(customer-specific config)"]
    B --> C["Account Override<br>(per-connected-account config)"]
    C --> D["Final Resolved Mapping<br>(deep-merged result)"]

Each level deep-merges on top of the previous one. At runtime, the engine resolves the final mapping by combining all three layers:

// How overrides are applied at runtime
private mergeIntegrationMappingConfigs(
  base: IntegrationMappingMethod,
  override?: IntegrationMappingMethod
): IntegrationMappingMethod {
  if (!override) return base;
  return deepmerge(base, override, { arrayMerge: overwriteMerge });
}

This means a customer can add their own custom field mappings to the unified response without anyone modifying source code. No engineering ticket. No deployment.

Tip

Always store your mapping overrides as JSON objects in your database. This allows you to hot-swap configurations without restarting your application servers.

This is a design pattern that pays for itself fast — especially when you consider how wildly different custom field implementations are across CRM platforms.

3. Abstract Away Pagination, Auth, and Rate Limits

True data normalization isn't just about JSON field names. It's about normalizing the behavior of every API your system talks to.

Consider pagination. HubSpot's contacts API uses cursor-based pagination with an after parameter. Salesforce uses a completely different cursor format returned in query results. Pipedrive uses offset-based pagination with a start parameter. Your application should not care. It should send a simple limit and cursor parameter to the unified layer, and the integration configuration should translate that into the provider's specific pagination strategy.

The same goes for authentication. Salesforce uses OAuth 2.0 with a specific token-refresh dance and instance-URL-based routing. HubSpot uses OAuth 2.0 with a different token format. Some CRMs still use API keys. Whether an API uses OAuth2 Bearer tokens, Basic Auth, or custom headers, the credential injection should happen at the proxy layer, entirely abstracted from the unified request. You should never be writing token refresh logic inside your data mapping functions.

And then there are rate limits — the silent integration killer. HubSpot enforces both per-second and daily rate limits. Salesforce has API call limits tied to your license tier. A unified API must detect 429 Too Many Requests responses, respect the Retry-After headers (which are formatted differently by every vendor), and handle the backoff transparently.

API Behavior	Salesforce	HubSpot	Pipedrive
Pagination	SOQL cursor	Cursor (`after` param)	Offset (`start` param)
Auth	OAuth 2.0 (instance URL)	OAuth 2.0 (bearer)	API key or OAuth 2.0
Rate Limits	License-tier based	Per-second + daily	Per-second
Filtering	SOQL `WHERE` clause	`filterGroups` array	Query params
Timestamps	`LastModifiedDate`	`properties.hs_lastmodifieddate`	`update_time`

A well-designed normalization layer captures all of these behavioral differences as configuration data — not as if/else branches in your codebase. Each integration's config describes its pagination strategy, auth scheme, and rate-limit detection rules. The runtime engine reads this config and executes the appropriate strategy generically.

4. Normalize Inbound Events (Unified Webhooks)

Normalization applies to inbound data just as much as outbound requests. When a third-party service fires a webhook, it hits your endpoint with a proprietary payload. Salesforce sends an outbound message; HubSpot sends a specific JSON structure; Pipedrive sends yet another format.

You must verify the webhook's authenticity (signature validation, JWT verification) and transform the raw payload into a standardized event format. A record:created event for a CRM contact should look identical whether it came from Salesforce or Pipedrive.

This requires a unified webhook receiver that:

Evaluates expressions against incoming payloads to extract the event type and entity reference
Enriches the event data by optionally fetching the full resource from the CRM
Delivers the normalized event to your application in a consistent format

Without this layer, your application ends up with the same spaghetti problem on the inbound side — a tangle of provider-specific webhook handlers, each parsing a different payload format. This is how you build reliable real-time CRM syncs for enterprise.

Code vs. Configuration: The Zero-Integration-Code Architecture

Let's make this concrete. The architectural difference between code-per-integration and configuration-per-integration isn't incremental — it's categorical.

In a code-driven approach, adding a new CRM means writing new handler functions, adding database columns or schemas, sprinkling conditional branches into shared logic, writing integration-specific tests, and going through a full deploy cycle.

In a configuration-driven approach, adding a new CRM means:

A JSON config describing how to talk to the API (base URL, endpoints, auth scheme, pagination strategy)
Declarative mapping expressions that transform between the unified schema and the native format

Both are stored as data. No code changes. No deployment.

The mapping layer uses JSONata — a functional query and transformation language for JSON — as the universal transformation engine. There is no if (provider === 'hubspot') anywhere in the runtime. Integration-specific behavior is defined entirely as data using JSONata expressions.

Here's what declarative response mappings look like for a CRM contact across two radically different API shapes:

# Salesforce: flat PascalCase fields
response_mapping: |-
  response.{
    "id": Id,
    "first_name": FirstName,
    "last_name": LastName,
    "email_addresses": [{ "email": Email }],
    "phone_numbers": $filter([
      { "number": Phone, "type": "phone" },
      { "number": MobilePhone, "type": "mobile" }
    ], function($v) { $v.number }),
    "created_at": CreatedDate,
    "updated_at": LastModifiedDate,
    "custom_fields": $sift($, function($v, $k) { $k ~> /__c$/i and $boolean($v) })
  }

# HubSpot: nested properties bag
response_mapping: |-
  {
    "id": response.id,
    "first_name": response.properties.firstname,
    "last_name": response.properties.lastname,
    "email_addresses": [
      response.properties.email
        ? { "email": response.properties.email, "is_primary": true }
    ],
    "phone_numbers": [
      response.properties.phone
        ? { "number": response.properties.phone, "type": "phone" },
      response.properties.mobilephone
        ? { "number": response.properties.mobilephone, "type": "mobile" }
    ],
    "created_at": response.createdAt,
    "updated_at": response.updatedAt
  }

Two completely different API response shapes. Two declarative expressions stored as data. The exact same runtime code processing both. Notice how the Salesforce mapping dynamically extracts custom fields by matching the __c suffix pattern — no hardcoded field list required.

The query mapping layer works the same way. When a unified request comes in to search for contacts, the engine loads the integration mapping from the database. For HubSpot, the query mapping translates unified filter parameters into HubSpot's filterGroups search syntax:

request_body_mapping: >-
  rawQuery.{
    "filterGroups": $firstNonEmpty(first_name, last_name, email_addresses)
      ? [{
        "filters":[
          first_name ? { "propertyName": "firstname", "operator": "CONTAINS_TOKEN", "value": first_name },
          last_name ? { "propertyName": "lastname", "operator": "CONTAINS_TOKEN", "value": last_name }
        ]
      }],
    "query": search_term
  }

For Salesforce, the exact same unified request is translated into a SOQL WHERE clause:

query_mapping: >-
  (
    $whereClause := query
      ? $convertQueryToSql(
        query,
        ["created_at", "updated_at", "email_addresses"],
        {
          "created_at": "CreatedDate",
          "updated_at": "LastModifiedDate",
          "email_addresses": "Email"
        }
      );
    {
      "q": query.search_term
        ? "FIND {" & query.search_term & "} RETURNING Contact(Id, FirstName)",
      "where": $whereClause ? "WHERE " & $whereClause
    }
  )

The execution engine simply evaluates the JSONata expression against the incoming request or response. It doesn't know what the expression does. It just executes it.

Info

Why does this matter for PMs? When a customer requests a new CRM integration, the question shifts from "How many sprints will this take?" to "How long does it take to write the config?" In many cases, that's days instead of weeks — and zero risk of breaking existing integrations.

This means adding a new CRM integration is a data operation, not a code operation. You add a JSON config describing the API and JSONata mapping expressions for the unified resources. The same unified API engine that handles 100 integrations today will handle the 101st without a single line of code being changed, compiled, or deployed.

Let's be honest about the trade-offs, though. A configuration-driven system requires upfront investment in building the generic execution engine. The mapping expressions have their own learning curve — JSONata isn't something most engineers know on day one. And edge cases in vendor APIs (undocumented response formats, inconsistent error codes, pagination bugs) still require investigation time, even if the fix is a config change rather than a code change.

The payoff comes at scale. When every integration flows through the same pipeline, bug fixes and improvements apply universally. Improved pagination logic benefits all 100+ integrations at once. An error-handling improvement works for every CRM simultaneously. The maintenance burden grows with the number of unique API patterns, not the number of integrations.

Appendix: Deduplication Recipes for Cross-App User Data

Once you've normalized user records from multiple SaaS apps into a common schema, you still need to answer: which records represent the same person? This section provides concrete recipes for deterministic matching, probabilistic matching, canonical ID assignment, confidence scoring, human review, and conflict resolution.

Deterministic Matching: Keys and SQL Examples

Start with high-confidence, exact-match keys. These are fields that reliably identify the same person across systems when they match exactly:

Matching Key	Reliability	Availability	Notes
Work email (normalized)	Very high	Almost universal	Best single cross-system matching key
Employee ID	Very high	HRIS, payroll, some IdPs	Not present in CRMs or ticketing tools
SSO subject / IdP ID	Very high	IdP-connected apps only	Exact if apps federate through the same IdP
Phone number (E.164)	Medium-high	Inconsistent across systems	Normalize to E.164 before comparing
Full name + company domain	Medium	Common but ambiguous	Collides at large companies

Email is the strongest cross-system matching key because nearly every SaaS app requires it. But raw email strings vary - the same address might appear as John.Doe@Acme.com, john.doe@acme.com, or john.doe+crm@acme.com. You must normalize before matching.

Normalization Rules and Examples

Apply these transformations to every email address before using it as a matching key:

def normalize_email(raw_email: str) -> str:
    email = raw_email.strip().lower()
    local, domain = email.rsplit('@', 1)
 
    # Strip plus-tag aliases (user+crm@domain.com -> user@domain.com)
    if '+' in local:
        local = local[:local.index('+')]
 
    # Gmail/Google Workspace: dots in local part are ignored
    gmail_domains = {'gmail.com', 'googlemail.com'}
    if domain in gmail_domains:
        local = local.replace('.', '')
 
    return f"{local}@{domain}"

The key rules:

Lowercase everything: User@Domain.com becomes user@domain.com
Trim whitespace: " user@domain.com " becomes user@domain.com
Strip plus-tags: user+salesforce@domain.com becomes user@domain.com
Remove dots for Gmail: j.doe@gmail.com becomes jdoe@gmail.com

For names, apply similar normalization:

def normalize_name(name: str) -> str:
    name = name.strip().lower()
    # Remove punctuation that varies across systems
    name = name.replace('.', '').replace(',', '').replace("'", '')
    # Collapse multiple spaces
    name = ' '.join(name.split())
    return name

For phone numbers, convert to E.164 format before matching: strip parentheses, dashes, spaces, and ensure a country code prefix. Store both the raw and E.164 forms so you can present the original back to source systems while matching on the canonical version.

Warning

Be careful with Gmail-specific dot removal on corporate domains. A company using Google Workspace at @acme.com may or may not ignore dots depending on their admin settings. When in doubt, only apply dot removal for gmail.com and googlemail.com.

Probabilistic Matching: Algorithms, SQL Patterns, and Thresholds

Deterministic matching catches the obvious duplicates. Probabilistic (fuzzy) matching catches the rest - typos, nickname variations, and inconsistent formatting that survived normalization. Different algorithms are suited to different field types:

Algorithm	Best for	Available in Postgres	Notes
Levenshtein distance	Short strings with typos (names, usernames)	`fuzzystrmatch` extension	Counts insertions, deletions, substitutions
Jaro-Winkler	Names (weights matching prefixes higher)	Not native; use extensions or app-side	Good for "Jon" vs. "Jonathan"
Trigram similarity	Longer strings, company names, addresses	`pg_trgm` extension	Compares 3-character n-gram overlap
Soundex / Metaphone	Phonetic name variants ("Smith" vs. "Smyth")	`fuzzystrmatch` extension	Language-specific; English-biased

Levenshtein distance (edit distance) quantifies the difference between two strings by calculating the minimum number of single-character edits - insertions, deletions, or substitutions - needed to transform one string into another. For user deduplication, here are practical thresholds:

Name matching (short strings): Use a Levenshtein distance threshold of ≤ 2 edits for names under 15 characters. This catches typos like "Jonh" vs. "John" or "Micheal" vs. "Michael" without matching genuinely different names.
Longer strings: Use a normalized similarity ratio instead: 1 - (levenshtein(a, b) / max(len(a), len(b))). A threshold of ≥ 0.85 (85% similar) works well for full names.
Trigram similarity: A threshold of ≥ 0.75 for company or account names catches variants like "Acme Corp" vs. "Acme Corporation". Drop to ≥ 0.6 to widen the review queue at the cost of more false positives.
Phonetic (metaphone): Use only as a secondary signal alongside another match. Phonetic-only matches produce too many false positives to trust standalone.

Blocking keys are critical for performance. Don't compare every record against every other record - that's O(n²). Instead, only compare records that share a blocking key:

Same email domain (@acme.com)
Same company or account name
Same first initial + last name
Same normalized phone country code + area code

def find_fuzzy_matches(records, threshold=2):
    """Find likely duplicates using blocking + Levenshtein distance."""
    # Group by blocking key (email domain)
    blocks = defaultdict(list)
    for record in records:
        domain = record.email_normalized.split('@')[1] if record.email_normalized else 'unknown'
        blocks[domain].append(record)
 
    matches = []
    for domain, block in blocks.items():
        for i, a in enumerate(block):
            for b in block[i + 1:]:
                name_a = normalize_name(f"{a.first_name} {a.last_name}")
                name_b = normalize_name(f"{b.first_name} {b.last_name}")
                dist = levenshtein(name_a, name_b)
                if dist <= threshold and a.source_app != b.source_app:
                    matches.append((a, b, dist, 'fuzzy_name'))
    return matches

In PostgreSQL, the same pattern using trigram similarity looks like this:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
 
-- Trigram similarity returns a score between 0.0 and 1.0.
-- Use a GIN index on the name column for practical performance.
CREATE INDEX IF NOT EXISTS idx_source_fullname_trgm
  ON user_source_records
  USING gin ((lower(first_name || ' ' || last_name)) gin_trgm_ops);
 
SELECT
  a.id AS record_a,
  b.id AS record_b,
  similarity(
    lower(a.first_name || ' ' || a.last_name),
    lower(b.first_name || ' ' || b.last_name)
  ) AS trigram_score,
  'fuzzy_trigram' AS match_type
FROM user_source_records a
JOIN user_source_records b
  ON a.source_app <> b.source_app
  AND a.id < b.id
WHERE
  -- Blocking key: same email domain
  split_part(a.email_normalized, '@', 2)
    = split_part(b.email_normalized, '@', 2)
  -- Trigram threshold
  AND similarity(
    lower(a.first_name || ' ' || a.last_name),
    lower(b.first_name || ' ' || b.last_name)
  ) >= 0.75
  AND a.canonical_id IS NULL;

Tip

Never auto-merge fuzzy matches in production without human review or a high-confidence secondary signal (like a matching employee ID or phone number). Fuzzy matching on name similarity alone will produce false positives at scale. Narrow the scope of fuzzy matches with exact match conditions - make sure every rule with fuzzy conditions has at least one exact match condition.

Canonical ID Assignment and Schema

Every unique person needs a single canonical ID that your application uses as the primary reference. Source-system IDs are stored as foreign references. Here's a practical schema:

CREATE TABLE canonical_users (
  canonical_id  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  primary_email VARCHAR(255),
  display_name  VARCHAR(255),
  created_at    TIMESTAMPTZ DEFAULT now(),
  updated_at    TIMESTAMPTZ DEFAULT now()
);
 
CREATE TABLE user_source_records (
  id               UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  canonical_id     UUID REFERENCES canonical_users(canonical_id),
  source_app       VARCHAR(100) NOT NULL,      -- e.g. 'salesforce', 'okta', 'hubspot'
  source_record_id VARCHAR(255) NOT NULL,       -- the ID in the source system
  email_raw        VARCHAR(255),
  email_normalized VARCHAR(255),
  first_name       VARCHAR(255),
  last_name        VARCHAR(255),
  all_emails       JSONB,                       -- for systems exposing multiple emails
  last_synced_at   TIMESTAMPTZ,
  raw_data         JSONB,                       -- full record from the source
  UNIQUE(source_app, source_record_id)
);
 
CREATE INDEX idx_source_email_norm ON user_source_records(email_normalized);
CREATE INDEX idx_source_canonical ON user_source_records(canonical_id);

The canonical_users table holds one row per real person. The user_source_records table holds every record from every connected app, each linked back to its canonical identity. Store both the raw email and the normalized email - you'll need the raw version for communicating back to source systems.

Sample Deduplication SQL Recipes

Step 1: Find deterministic matches (exact normalized email across different source apps):

SELECT
  a.id AS record_a,
  b.id AS record_b,
  a.email_normalized AS matched_on,
  'exact_email' AS match_type,
  1.00 AS confidence
FROM user_source_records a
JOIN user_source_records b
  ON a.email_normalized = b.email_normalized
  AND a.source_app <> b.source_app
  AND a.id < b.id          -- avoid duplicate pairs
WHERE a.email_normalized IS NOT NULL
  AND a.canonical_id IS NULL;

Step 2a: Fuzzy match on name within the same email domain using Levenshtein (requires the fuzzystrmatch extension):

CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
 
SELECT
  a.id AS record_a,
  b.id AS record_b,
  levenshtein(
    lower(a.first_name || ' ' || a.last_name),
    lower(b.first_name || ' ' || b.last_name)
  ) AS edit_distance,
  'fuzzy_name_lev' AS match_type
FROM user_source_records a
JOIN user_source_records b
  ON a.source_app <> b.source_app
  AND a.id < b.id
WHERE
  -- Blocking key: same email domain
  split_part(a.email_normalized, '@', 2) = split_part(b.email_normalized, '@', 2)
  -- Threshold: max 2 edits
  AND levenshtein(
    lower(a.first_name || ' ' || a.last_name),
    lower(b.first_name || ' ' || b.last_name)
  ) <= 2
  AND a.canonical_id IS NULL;

Step 2b: Store all candidate pairs with their raw signals in a staging table so a scoring step can rank them:

CREATE TABLE dedupe_candidate_pairs (
  id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  record_a          UUID REFERENCES user_source_records(id),
  record_b          UUID REFERENCES user_source_records(id),
  email_exact_match BOOLEAN NOT NULL DEFAULT false,
  domain_match      BOOLEAN NOT NULL DEFAULT false,
  name_lev_distance INT,
  name_trgm_score   NUMERIC(4,3),
  phone_exact_match BOOLEAN NOT NULL DEFAULT false,
  employee_id_match BOOLEAN NOT NULL DEFAULT false,
  confidence        NUMERIC(4,3),
  status            VARCHAR(20) DEFAULT 'pending', -- pending | auto_merged | needs_review | rejected
  created_at        TIMESTAMPTZ DEFAULT now(),
  reviewed_by       VARCHAR(255),
  reviewed_at       TIMESTAMPTZ,
  UNIQUE(record_a, record_b)
);
 
CREATE INDEX idx_pairs_status ON dedupe_candidate_pairs(status);
CREATE INDEX idx_pairs_confidence ON dedupe_candidate_pairs(confidence DESC);

Step 3: Assign canonical IDs to matched record clusters:

-- For each cluster of matched records, create a canonical user
-- and link all source records to it.
-- Use a connected-components algorithm to group transitive matches.
 
-- Simplified: for each deterministic match pair, assign or merge canonical IDs
WITH new_canonicals AS (
  INSERT INTO canonical_users (primary_email, display_name)
  SELECT DISTINCT ON (a.email_normalized)
    a.email_normalized,
    a.first_name || ' ' || a.last_name
  FROM dedupe_candidate_pairs m
  JOIN user_source_records a ON a.id = m.record_a
  WHERE a.canonical_id IS NULL
    AND m.status IN ('auto_merged')
  RETURNING canonical_id, primary_email
)
UPDATE user_source_records usr
SET canonical_id = nc.canonical_id
FROM new_canonicals nc
WHERE usr.email_normalized = nc.primary_email
  AND usr.canonical_id IS NULL;

Confidence Scoring and Manual Review UX

A single yes/no match verdict is too blunt for real data. Instead, compute a composite confidence score for each candidate pair by combining the individual signals you gathered above, then route pairs into three buckets: auto-merge, human review, or reject.

A workable weighted model:

SIGNAL_WEIGHTS = {
    'email_exact_match':   0.60,   # normalized email matches exactly
    'employee_id_match':   0.35,   # same employee ID across apps
    'phone_exact_match':   0.25,   # matching E.164 phone
    'domain_match':        0.05,   # same corporate email domain
    'name_similarity':     0.20,   # 0.0-1.0, scaled trigram or Jaro-Winkler
    'title_match':         0.05,   # exact normalized job title
}
 
def score_pair(signals: dict) -> float:
    score = 0.0
    for key, weight in SIGNAL_WEIGHTS.items():
        value = signals.get(key)
        if isinstance(value, bool):
            score += weight if value else 0.0
        elif isinstance(value, (int, float)):
            score += weight * float(value)   # e.g. name_similarity in [0,1]
    return min(score, 1.0)

The corresponding SQL that scores every candidate pair and assigns a routing status:

UPDATE dedupe_candidate_pairs
SET
  confidence = LEAST(1.0,
    (CASE WHEN email_exact_match  THEN 0.60 ELSE 0 END) +
    (CASE WHEN employee_id_match  THEN 0.35 ELSE 0 END) +
    (CASE WHEN phone_exact_match  THEN 0.25 ELSE 0 END) +
    (CASE WHEN domain_match       THEN 0.05 ELSE 0 END) +
    (COALESCE(name_trgm_score, 0) * 0.20)
  ),
  status = CASE
    WHEN email_exact_match AND (employee_id_match OR phone_exact_match)
      THEN 'auto_merged'
    WHEN email_exact_match AND COALESCE(name_trgm_score, 0) >= 0.60
      THEN 'auto_merged'
    WHEN COALESCE(name_trgm_score, 0) >= 0.85 AND domain_match
      THEN 'needs_review'
    WHEN name_lev_distance <= 2 AND domain_match
      THEN 'needs_review'
    ELSE 'rejected'
  END
WHERE status = 'pending';

Suggested confidence bands (tune these against a labelled sample from your own data):

Confidence	Action	Rationale
≥ 0.90	Auto-merge	High-signal deterministic evidence (email + secondary key)
0.70 - 0.89	Send to human review queue	Strong but not certain; false positives too costly to auto-merge
0.50 - 0.69	Show as "possible duplicate" hint only	Low confidence; surface but do not act
< 0.50	Reject	Noise; skip to keep the queue actionable

Manual review UX guidance. Your reviewers are looking at two records side by side and deciding: same person, different people, or need more info. Design the UI around that decision:

Side-by-side layout with both records' fields aligned. Highlight fields that match in green, fields that differ in yellow, and fields present in only one record in gray.
Lead with the confidence score and the signals that produced it ("Email exact match + name 92% similar"). Reviewers should never have to reverse-engineer why the pair was flagged.
Show source context: which apps the records came from, when they were last synced, and their raw source-system IDs. A reviewer needs to know that record A came from Okta yesterday and record B came from Salesforce three months ago.
Three clear actions: Merge, Keep Separate, or Split into New (for cases where an existing merge was wrong). Avoid dropdowns or hidden menus - reviewers should resolve a pair in one click.
Order the queue by confidence descending, so reviewers work through the highest-value decisions first. Add a filter for source-app pair (e.g. "Okta ↔ Salesforce") so reviewers can batch similar decisions.
Keep an audit trail. Log who reviewed a pair, when, and their decision. If an automated merge is later challenged, you need to be able to trace how it happened.
Feed decisions back into thresholds. Track precision and recall on human-reviewed pairs and re-tune the confidence weights and cut-offs monthly. If reviewers are rejecting 80% of pairs at a given score band, that band's threshold is too low.

Tip

Cap the review queue depth. If your queue grows past a few hundred pending pairs, reviewers stop working through it. Tighten thresholds or add a blocking key rather than letting the queue become a graveyard.

Conflict Resolution Policies

When two source systems disagree about the same field - say Salesforce has "Jonathan" and Okta has "Jon" for first name - you need explicit rules for which value wins.

Source priority is the simplest and most predictable strategy. Rank your sources by trustworthiness for each data type:

# Define trust hierarchy per field type
SOURCE_PRIORITY = {
    'email':      ['okta', 'google_workspace', 'azure_ad', 'salesforce', 'hubspot'],
    'name':       ['workday', 'bamboohr', 'okta', 'salesforce', 'hubspot'],
    'title':      ['workday', 'bamboohr', 'salesforce', 'hubspot'],
    'phone':      ['workday', 'salesforce', 'hubspot'],
    'department': ['workday', 'bamboohr', 'okta'],
}
 
def resolve_field(field_name: str, values_by_source: dict) -> str | None:
    """Pick the winning value using source priority."""
    priority = SOURCE_PRIORITY.get(field_name, list(values_by_source.keys()))
    for source in priority:
        value = values_by_source.get(source)
        if value and str(value).strip():
            return value
    return None

The logic: identity providers (Okta, Azure AD) are the most authoritative source for email. HRIS systems (Workday, BambooHR) are the most authoritative for name and department. CRMs fill in gaps.

Other resolution strategies:

Most recently updated: Use the value from whichever source synced most recently. Simple but risky - a bad sync overwrites good data.
Most complete: For array fields (phone numbers, addresses), merge all non-empty values from all sources rather than picking one winner.
Manual review queue: For high-value fields where automated resolution is risky, flag conflicts for human review.

Edge Cases to Handle

Multiple emails per person: A user might have john@acme.com in Salesforce and john.doe@acme.com in Okta. Store all emails in an array on the canonical record and normalize all of them for matching. Use the all_emails JSONB column in user_source_records for systems that expose multiple email addresses per user.

Shared accounts: Some systems have service accounts or shared mailboxes (support@acme.com, billing@acme.com). Exclude these from person-level deduplication. Maintain a blocklist of known shared-account patterns, or flag any email that appears as a primary address on more than N source records from the same app.

Name changes: Employees change names (marriage, legal name change). If you rely on name as a secondary matching signal, a name change can break an existing match. Always prefer email or employee ID as the primary matching key. Treat name as a confirming signal, not a primary one.

Rehires and transfers: Someone leaves and comes back. Their old records might be deactivated across systems while new records are created. Match on email first - if an existing canonical record exists, reactivate it rather than creating a duplicate.

Stop Hardcoding Your CRM Integrations

If you're currently maintaining a pile of if (provider === 'salesforce') branches, or evaluating how to build the CRM integrations your B2B sales team actually asks for, here's the strategic takeaway:

Design your canonical schema first — get the entity relationships right (Accounts, Contacts, Opportunities, Pipelines, Stages) before obsessing over individual field mappings.
Build for custom fields from day one — a normalization layer that drops custom data is only solving the easy half of the problem.
Normalize API behavior, not just data shapes — pagination, auth, rate limiting, webhooks, and error handling all need abstraction.
Prefer configuration over code — every if/else branch you write for a specific CRM is a maintenance liability that grows linearly with your integration count.
Implement a multi-level override system — so customers can customize mappings for their specific CRM setup without filing engineering tickets.
Plan for deduplication from the start — pulling user data from multiple SaaS apps is only half the job. Deterministic matching, probabilistic matching, confidence scoring, and a human review workflow turn raw records into a reliable canonical user directory.

Data normalization should be handled by a configuration-driven unified API. By treating integration behavior as data rather than code, you decouple your application from the chaotic reality of third-party APIs. Your engineering team should be building your core product, not reading terrible API documentation.

What is the Best Way to Normalize Data Models Across Different CRMs?

Why CRM Data Normalization is a Nightmare

Beyond CRMs: Pulling Unified User Data from Every SaaS App

The Traditional Approaches (And Why They Break)

Point-to-Point Integrations: The Spaghetti Factory

Rigid Unified APIs: The Lowest-Common-Denominator Trap

What is the Best Way to Normalize Data Models Across Different CRMs?

1. Standardize the Core CRM Trinity

2. Handle Custom Fields and Objects Dynamically

4. Normalize Inbound Events (Unified Webhooks)

Code vs. Configuration: The Zero-Integration-Code Architecture

Appendix: Deduplication Recipes for Cross-App User Data

Deterministic Matching: Keys and SQL Examples

Normalization Rules and Examples

Probabilistic Matching: Algorithms, SQL Patterns, and Thresholds

Canonical ID Assignment and Schema

Sample Deduplication SQL Recipes

Confidence Scoring and Manual Review UX

Conflict Resolution Policies

Edge Cases to Handle

Stop Hardcoding Your CRM Integrations

FAQ

More from our Blog

Why Schema Normalization is the Hardest Problem in SaaS Integrations

What is a Unified API?

How to Handle Custom Fields and Custom Objects in Salesforce via API

Architecting Real-Time CRM Syncs for Enterprise: A Technical Guide

Building Native CRM Integrations Without Draining Engineering in 2026

How to Build Integrations Your B2B Sales Team Actually Asks For

What is a Common Data Model in APIs? (2026 Architecture Guide)