What is the Best Way to Normalize Data Models Across Different CRMs?
Learn how to normalize data models across CRMs and SaaS apps into a unified schema. Covers declarative mappings, custom fields, deduplication recipes with SQL examples, and canonical ID architecture.
CRM data normalization is the practice of translating disparate data structures from platforms like Salesforce, HubSpot, Pipedrive, and Close into a single, canonical schema your application can rely on. If you're a product manager evaluating how to ship integrations across multiple CRMs without drowning your engineering team, the short answer is: treat integration behavior as data configuration, not hardcoded logic. The same principle applies beyond CRMs - if you need to pull a unified list of users from every SaaS app your customers use without building each integration one by one, the normalization architecture is identical. This article breaks down exactly why that matters and how to do it.
Why CRM Data Normalization is a Nightmare
The problem starts with a deceptively simple question: "What is a Contact?"
In HubSpot, a Contact is a flat object. Data lives in a properties bag — properties.firstname, properties.lastname, properties.email. Timestamps are tracked via properties.hs_lastmodifieddate. HubSpot doesn't even have a separate Lead object — it uses a Lifecycle Stage property on the Contact to track progression from lead to customer.
Salesforce takes a fundamentally different approach. A Contact is a PascalCase object (FirstName, LastName, Email) that sits inside a sprawling web of standard and custom object relationships. Salesforce does have a separate Lead object, and when a Lead converts to a Contact, the original Lead record is destroyed. Filtering uses SOQL (a SQL dialect), modification tracking is via LastModifiedDate, and custom fields are identified by a __c suffix.
These aren't cosmetic differences. They represent fundamentally different opinions about how to model customer relationships. And every CRM in the market has its own opinion.
The financial stakes are real. Gartner estimates poor data quality costs the average enterprise $12.9 to $15 million annually. Research suggests duplication rates between 10-30% are normal for companies without active data quality programs, and integrations that don't normalize and deduplicate correctly are a primary driver of that problem. Plauti's analysis of 12 billion Salesforce records found 45% were duplicates across organizations, and that rate jumps to 80% for records created via API integrations.
Meanwhile, the integration surface area keeps expanding. Companies used an average of 106 SaaS applications in 2024, with large enterprises running up to 131 different apps. Your product must integrate with an increasingly fragmented software landscape, and your CRM — supposed to be your system of record — corrodes when integration data flows in without proper normalization.
Understanding why schema normalization is the hardest problem in SaaS integrations is the first step. The second is architecting a system that actually solves it.
Beyond CRMs: Pulling Unified User Data from Every SaaS App
The normalization problem doesn't stop at CRMs. If you're building a B2B product and need to pull a list of users from every SaaS app your customers use - without building each integration individually - you're facing the same structural challenge at a much wider scale.
Think about what "user data" looks like across a typical customer's stack. Their identity provider (Okta, Azure AD) has users with roles and group memberships. Their HRIS (Workday, BambooHR) has employees with departments, managers, and employment status. Their CRM has contacts and account owners. Their ticketing system (Jira, Zendesk) has agents. Every one of these systems stores people records in its own format, with its own ID scheme, and its own API conventions.
The unified API approach described in this article applies directly to this problem. A canonical user schema standardizes the core entities - Users, Groups, Roles, Licenses, and Activities - across all these tools. Per-integration mappings translate each app's native format into that canonical shape. The same configuration-driven engine handles the API differences (auth, pagination, rate limits) without integration-specific code.
The real payoff is operational. With unified user data across SaaS tools, you can programmatically manage onboarding and offboarding across all connected apps, synchronize role-based access control by mapping application permissions to a standard roles model, audit license utilization to identify unused seats, and ingest activity logs into a single dashboard for security monitoring. All from one API surface, regardless of how many SaaS apps are connected underneath.
But pulling user records from 10 different apps into one list creates a new problem: deduplication. The same person exists as john.doe@acme.com in Okta, John Doe in Salesforce, and employee ID EMP-4472 in Workday. Matching these records into a single canonical identity requires more than field mapping - it requires matching logic, normalization rules, and conflict resolution. We cover exactly how to do that in the deduplication appendix below.
The Traditional Approaches (And Why They Break)
When faced with schema normalization, engineering teams usually default to one of two flawed architectures.
Point-to-Point Integrations: The Spaghetti Factory
The most common first attempt is building direct integrations. Your team writes a Salesforce adapter, then a HubSpot adapter, then Pipedrive, then Zoho. Each has its own code path:
// This is how most teams start. It doesn't end well.
if (provider === 'hubspot') {
contact.firstName = data.properties.firstname;
contact.email = data.properties.email;
} else if (provider === 'salesforce') {
contact.firstName = data.FirstName;
contact.email = data.Email;
} else if (provider === 'pipedrive') {
contact.firstName = data.first_name;
contact.email = data.email[0].value;
}This works for two CRMs. By the fifth, you're maintaining a growing tangle of conditional logic, each branch with its own pagination strategy, auth flow, rate-limit handling, and undocumented edge cases. Every time a vendor deprecates an endpoint, alters a rate limit, or changes a pagination cursor, you have to write code, review it, and deploy it.
The cost isn't trivial. Adding third-party integrations and custom features to a SaaS platform typically pushes development costs into the $50,000 to $150,000+ range per integration when you account for initial build, testing, and ongoing maintenance. You are essentially paying top-tier engineers to read terrible vendor documentation.
Rigid Unified APIs: The Lowest-Common-Denominator Trap
To escape the maintenance burden of custom code, teams turn to standard unified APIs. The promise is alluring: write to one common data model, and the provider handles the translation.
The problem: most of these platforms implement what amounts to a rigid key-value mapping layer. They map vendor_field_A to unified_field_B and call it done. This works for the 20% of fields that are standard across CRMs. But what about the other 80%?
Every serious Salesforce deployment has custom fields (Lead_Score__c, Region__c, Contract_Value__c). Every HubSpot instance has custom properties. A unified API that strips away custom fields to maintain a clean common model is losing the data your customers care about most.
This is what we call the implementation gap — the distance between what a unified API's common model covers and what your customers actually need. If closing that gap requires filing support tickets and waiting for the provider to update their mappings, you've traded one bottleneck (building integrations) for another (waiting on a vendor). Your customers complain that the integration is missing critical data, and you end up building custom point-to-point integrations anyway.
What is the Best Way to Normalize Data Models Across Different CRMs?
The best approach is a declarative, configuration-driven architecture where:
- A canonical schema defines what your application sees (the unified interface)
- Per-integration mappings define how each CRM's native format translates to and from that schema (the transformation layer)
- A generic execution engine processes both without any integration-specific code
This means adding a new CRM integration or modifying how a field maps is a data operation — not a code deployment. This approach decouples the integration logic from your application runtime, allowing you to standardize the core entities while maintaining the flexibility to handle custom fields dynamically. Here's how to implement each piece.
1. Standardize the Core CRM Trinity
Before writing a single mapping, you need a well-designed canonical data model. For CRMs, this centers on what we call the Core Trinity: Accounts, Contacts, and Opportunities.
| Unified Entity | Salesforce | HubSpot | Pipedrive | Close |
|---|---|---|---|---|
| Account | Account | Company | Organization | Lead (organization-level) |
| Contact | Contact (+ Lead) | Contact | Person | Contact |
| Opportunity | Opportunity | Deal | Deal | Opportunity |
Your canonical schema should model the relationships between these entities, not just the fields:
- Account: The primary container for customer data (the company).
- Contact: The people who work at that Account.
- Opportunity: The active deals being negotiated with that Account.
- Lead: Unqualified prospects that exist independently at the top of the funnel.
When qualified, a Lead is typically converted into a Contact, an Account, and an Opportunity. Then comes the pipeline model: an Opportunity belongs to a specific Pipeline, and its status is defined by a Stage within that pipeline. Users generate Engagements, Notes, and Tasks that are appended to the relevant entity to create a historical timeline.
erDiagram
Account ||--o{ Contact : employs
Account ||--o{ Opportunity : has
Opportunity }o--|| Pipeline : belongs_to
Opportunity }o--|| Stage : current_status
Lead ||--o| Contact : converts_to
Lead ||--o| Account : converts_to
Lead ||--o| Opportunity : converts_to
User ||--o{ Engagement : creates
User ||--o{ Note : creates
User ||--o{ Task : createsGetting this relational model right matters more than getting every field mapped from day one. If your canonical schema has sound entity relationships, you can always add fields later. If the relationships are wrong, every downstream consumer breaks. By standardizing this relationship graph, your application only needs to understand one data shape.
2. Handle Custom Fields and Objects Dynamically
Standardizing the core trinity is the easy part. The real challenge is handling the long tail of custom fields — and it's where most normalization strategies fail.
Salesforce identifies custom fields by the __c suffix. HubSpot stores them as additional keys in the properties object. Pipedrive uses numeric custom field IDs. You need a normalization strategy that handles all three patterns without losing data.
The worst thing you can do is ignore them. Custom fields contain the business-specific data your customers configured their CRM around — deal scoring, contract types, region codes, compliance flags. Dropping them means your integration is only half-useful. Learning how to handle custom fields and custom objects in Salesforce via API highlights why this flexibility is mandatory.
The solution is a multi-layered override system that lets mappings be customized at different levels without requiring your engineers to write custom code:
- Platform level — the default mapping that works for 80% of use cases.
- Environment level — overrides for specific customer deployments (e.g., a customer's Salesforce instance with a non-standard field layout).
- Account level — overrides for individual connected accounts (e.g., one customer's
Contract_Value__cfield maps to a unifieddeal_valuefield).
flowchart TD
A["Platform Base Mapping<br>(default for all customers)"] --> B["Environment Override<br>(customer-specific config)"]
B --> C["Account Override<br>(per-connected-account config)"]
C --> D["Final Resolved Mapping<br>(deep-merged result)"]Each level deep-merges on top of the previous one. At runtime, the engine resolves the final mapping by combining all three layers:
// How overrides are applied at runtime
private mergeIntegrationMappingConfigs(
base: IntegrationMappingMethod,
override?: IntegrationMappingMethod
): IntegrationMappingMethod {
if (!override) return base;
return deepmerge(base, override, { arrayMerge: overwriteMerge });
}This means a customer can add their own custom field mappings to the unified response without anyone modifying source code. No engineering ticket. No deployment.
Always store your mapping overrides as JSON objects in your database. This allows you to hot-swap configurations without restarting your application servers.
This is a design pattern that pays for itself fast — especially when you consider how wildly different custom field implementations are across CRM platforms.
3. Abstract Away Pagination, Auth, and Rate Limits
True data normalization isn't just about JSON field names. It's about normalizing the behavior of every API your system talks to.
Consider pagination. HubSpot's contacts API uses cursor-based pagination with an after parameter. Salesforce uses a completely different cursor format returned in query results. Pipedrive uses offset-based pagination with a start parameter. Your application should not care. It should send a simple limit and cursor parameter to the unified layer, and the integration configuration should translate that into the provider's specific pagination strategy.
The same goes for authentication. Salesforce uses OAuth 2.0 with a specific token-refresh dance and instance-URL-based routing. HubSpot uses OAuth 2.0 with a different token format. Some CRMs still use API keys. Whether an API uses OAuth2 Bearer tokens, Basic Auth, or custom headers, the credential injection should happen at the proxy layer, entirely abstracted from the unified request. You should never be writing token refresh logic inside your data mapping functions.
And then there are rate limits — the silent integration killer. HubSpot enforces both per-second and daily rate limits. Salesforce has API call limits tied to your license tier. A unified API must detect 429 Too Many Requests responses, respect the Retry-After headers (which are formatted differently by every vendor), and handle the backoff transparently.
| API Behavior | Salesforce | HubSpot | Pipedrive |
|---|---|---|---|
| Pagination | SOQL cursor | Cursor (after param) |
Offset (start param) |
| Auth | OAuth 2.0 (instance URL) | OAuth 2.0 (bearer) | API key or OAuth 2.0 |
| Rate Limits | License-tier based | Per-second + daily | Per-second |
| Filtering | SOQL WHERE clause |
filterGroups array |
Query params |
| Timestamps | LastModifiedDate |
properties.hs_lastmodifieddate |
update_time |
A well-designed normalization layer captures all of these behavioral differences as configuration data — not as if/else branches in your codebase. Each integration's config describes its pagination strategy, auth scheme, and rate-limit detection rules. The runtime engine reads this config and executes the appropriate strategy generically.
4. Normalize Inbound Events (Unified Webhooks)
Normalization applies to inbound data just as much as outbound requests. When a third-party service fires a webhook, it hits your endpoint with a proprietary payload. Salesforce sends an outbound message; HubSpot sends a specific JSON structure; Pipedrive sends yet another format.
You must verify the webhook's authenticity (signature validation, JWT verification) and transform the raw payload into a standardized event format. A record:created event for a CRM contact should look identical whether it came from Salesforce or Pipedrive.
This requires a unified webhook receiver that:
- Evaluates expressions against incoming payloads to extract the event type and entity reference
- Enriches the event data by optionally fetching the full resource from the CRM
- Delivers the normalized event to your application in a consistent format
Without this layer, your application ends up with the same spaghetti problem on the inbound side — a tangle of provider-specific webhook handlers, each parsing a different payload format. This is how you build reliable real-time CRM syncs for enterprise.
Code vs. Configuration: The Zero-Integration-Code Architecture
Let's make this concrete. The architectural difference between code-per-integration and configuration-per-integration isn't incremental — it's categorical.
In a code-driven approach, adding a new CRM means writing new handler functions, adding database columns or schemas, sprinkling conditional branches into shared logic, writing integration-specific tests, and going through a full deploy cycle.
In a configuration-driven approach, adding a new CRM means:
- A JSON config describing how to talk to the API (base URL, endpoints, auth scheme, pagination strategy)
- Declarative mapping expressions that transform between the unified schema and the native format
Both are stored as data. No code changes. No deployment.
The mapping layer uses JSONata — a functional query and transformation language for JSON — as the universal transformation engine. There is no if (provider === 'hubspot') anywhere in the runtime. Integration-specific behavior is defined entirely as data using JSONata expressions.
Here's what declarative response mappings look like for a CRM contact across two radically different API shapes:
# Salesforce: flat PascalCase fields
response_mapping: |-
response.{
"id": Id,
"first_name": FirstName,
"last_name": LastName,
"email_addresses": [{ "email": Email }],
"phone_numbers": $filter([
{ "number": Phone, "type": "phone" },
{ "number": MobilePhone, "type": "mobile" }
], function($v) { $v.number }),
"created_at": CreatedDate,
"updated_at": LastModifiedDate,
"custom_fields": $sift($, function($v, $k) { $k ~> /__c$/i and $boolean($v) })
}# HubSpot: nested properties bag
response_mapping: |-
{
"id": response.id,
"first_name": response.properties.firstname,
"last_name": response.properties.lastname,
"email_addresses": [
response.properties.email
? { "email": response.properties.email, "is_primary": true }
],
"phone_numbers": [
response.properties.phone
? { "number": response.properties.phone, "type": "phone" },
response.properties.mobilephone
? { "number": response.properties.mobilephone, "type": "mobile" }
],
"created_at": response.createdAt,
"updated_at": response.updatedAt
}Two completely different API response shapes. Two declarative expressions stored as data. The exact same runtime code processing both. Notice how the Salesforce mapping dynamically extracts custom fields by matching the __c suffix pattern — no hardcoded field list required.
The query mapping layer works the same way. When a unified request comes in to search for contacts, the engine loads the integration mapping from the database. For HubSpot, the query mapping translates unified filter parameters into HubSpot's filterGroups search syntax:
request_body_mapping: >-
rawQuery.{
"filterGroups": $firstNonEmpty(first_name, last_name, email_addresses)
? [{
"filters":[
first_name ? { "propertyName": "firstname", "operator": "CONTAINS_TOKEN", "value": first_name },
last_name ? { "propertyName": "lastname", "operator": "CONTAINS_TOKEN", "value": last_name }
]
}],
"query": search_term
}For Salesforce, the exact same unified request is translated into a SOQL WHERE clause:
query_mapping: >-
(
$whereClause := query
? $convertQueryToSql(
query,
["created_at", "updated_at", "email_addresses"],
{
"created_at": "CreatedDate",
"updated_at": "LastModifiedDate",
"email_addresses": "Email"
}
);
{
"q": query.search_term
? "FIND {" & query.search_term & "} RETURNING Contact(Id, FirstName)",
"where": $whereClause ? "WHERE " & $whereClause
}
)The execution engine simply evaluates the JSONata expression against the incoming request or response. It doesn't know what the expression does. It just executes it.
Why does this matter for PMs? When a customer requests a new CRM integration, the question shifts from "How many sprints will this take?" to "How long does it take to write the config?" In many cases, that's days instead of weeks — and zero risk of breaking existing integrations.
This means adding a new CRM integration is a data operation, not a code operation. You add a JSON config describing the API and JSONata mapping expressions for the unified resources. The same unified API engine that handles 100 integrations today will handle the 101st without a single line of code being changed, compiled, or deployed.
Let's be honest about the trade-offs, though. A configuration-driven system requires upfront investment in building the generic execution engine. The mapping expressions have their own learning curve — JSONata isn't something most engineers know on day one. And edge cases in vendor APIs (undocumented response formats, inconsistent error codes, pagination bugs) still require investigation time, even if the fix is a config change rather than a code change.
The payoff comes at scale. When every integration flows through the same pipeline, bug fixes and improvements apply universally. Improved pagination logic benefits all 100+ integrations at once. An error-handling improvement works for every CRM simultaneously. The maintenance burden grows with the number of unique API patterns, not the number of integrations.
Appendix: Deduplication Recipes for Cross-App User Data
Once you've normalized user records from multiple SaaS apps into a common schema, you still need to answer: which records represent the same person? This section provides concrete recipes for deterministic matching, fuzzy matching, canonical ID assignment, and conflict resolution.
Deterministic Matching Keys
Start with high-confidence, exact-match keys. These are fields that reliably identify the same person across systems when they match exactly:
| Matching Key | Reliability | Availability | Notes |
|---|---|---|---|
| Work email (normalized) | Very high | Almost universal | Best single cross-system matching key |
| Employee ID | Very high | HRIS, payroll, some IdPs | Not present in CRMs or ticketing tools |
| SSO subject / IdP ID | Very high | IdP-connected apps only | Exact if apps federate through the same IdP |
| Phone number (E.164) | Medium-high | Inconsistent across systems | Normalize to E.164 before comparing |
| Full name + company domain | Medium | Common but ambiguous | Collides at large companies |
Email is the strongest cross-system matching key because nearly every SaaS app requires it. But raw email strings vary - the same address might appear as John.Doe@Acme.com, john.doe@acme.com, or john.doe+crm@acme.com. You must normalize before matching.
Normalization Rules and Examples
Apply these transformations to every email address before using it as a matching key:
def normalize_email(raw_email: str) -> str:
email = raw_email.strip().lower()
local, domain = email.rsplit('@', 1)
# Strip plus-tag aliases (user+crm@domain.com -> user@domain.com)
if '+' in local:
local = local[:local.index('+')]
# Gmail/Google Workspace: dots in local part are ignored
gmail_domains = {'gmail.com', 'googlemail.com'}
if domain in gmail_domains:
local = local.replace('.', '')
return f"{local}@{domain}"The key rules:
- Lowercase everything:
User@Domain.combecomesuser@domain.com - Trim whitespace:
" user@domain.com "becomesuser@domain.com - Strip plus-tags:
user+salesforce@domain.combecomesuser@domain.com - Remove dots for Gmail:
j.doe@gmail.combecomesjdoe@gmail.com
For names, apply similar normalization:
def normalize_name(name: str) -> str:
name = name.strip().lower()
# Remove punctuation that varies across systems
name = name.replace('.', '').replace(',', '').replace("'", '')
# Collapse multiple spaces
name = ' '.join(name.split())
return nameFor phone numbers, convert to E.164 format before matching: strip parentheses, dashes, spaces, and ensure a country code prefix.
Be careful with Gmail-specific dot removal on corporate domains. A company using Google Workspace at @acme.com may or may not ignore dots depending on their admin settings. When in doubt, only apply dot removal for gmail.com and googlemail.com.
Fuzzy Matching Algorithms and Thresholds
Deterministic matching catches the obvious duplicates. Fuzzy matching catches the rest - typos, nickname variations, and inconsistent formatting that survived normalization.
Levenshtein distance (edit distance) quantifies the difference between two strings by calculating the minimum number of single-character edits - insertions, deletions, or substitutions - needed to transform one string into another. For user deduplication, here are practical thresholds:
- Name matching (short strings): Use a Levenshtein distance threshold of ≤ 2 edits for names under 15 characters. This catches typos like "Jonh" vs. "John" or "Micheal" vs. "Michael" without matching genuinely different names.
- Longer strings: Use a normalized similarity ratio instead:
1 - (levenshtein(a, b) / max(len(a), len(b))). A threshold of ≥ 0.85 (85% similar) works well for full names.
Blocking keys are critical for performance. Don't compare every record against every other record - that's O(n²). Instead, only compare records that share a blocking key:
- Same email domain (
@acme.com) - Same company or account name
- Same first initial + last name
def find_fuzzy_matches(records, threshold=2):
"""Find likely duplicates using blocking + Levenshtein distance."""
# Group by blocking key (email domain)
blocks = defaultdict(list)
for record in records:
domain = record.email_normalized.split('@')[1] if record.email_normalized else 'unknown'
blocks[domain].append(record)
matches = []
for domain, block in blocks.items():
for i, a in enumerate(block):
for b in block[i + 1:]:
name_a = normalize_name(f"{a.first_name} {a.last_name}")
name_b = normalize_name(f"{b.first_name} {b.last_name}")
dist = levenshtein(name_a, name_b)
if dist <= threshold and a.source_app != b.source_app:
matches.append((a, b, dist, 'fuzzy_name'))
return matchesNever auto-merge fuzzy matches in production without human review or a high-confidence secondary signal (like a matching employee ID or phone number). Fuzzy matching on name similarity alone will produce false positives at scale. Narrow the scope of fuzzy matches with exact match conditions - make sure every rule with fuzzy conditions has at least one exact match condition.
Canonical ID Assignment and Schema
Every unique person needs a single canonical ID that your application uses as the primary reference. Source-system IDs are stored as foreign references. Here's a practical schema:
CREATE TABLE canonical_users (
canonical_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
primary_email VARCHAR(255),
display_name VARCHAR(255),
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE user_source_records (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
canonical_id UUID REFERENCES canonical_users(canonical_id),
source_app VARCHAR(100) NOT NULL, -- e.g. 'salesforce', 'okta', 'hubspot'
source_record_id VARCHAR(255) NOT NULL, -- the ID in the source system
email_raw VARCHAR(255),
email_normalized VARCHAR(255),
first_name VARCHAR(255),
last_name VARCHAR(255),
all_emails JSONB, -- for systems exposing multiple emails
last_synced_at TIMESTAMPTZ,
raw_data JSONB, -- full record from the source
UNIQUE(source_app, source_record_id)
);
CREATE INDEX idx_source_email_norm ON user_source_records(email_normalized);
CREATE INDEX idx_source_canonical ON user_source_records(canonical_id);The canonical_users table holds one row per real person. The user_source_records table holds every record from every connected app, each linked back to its canonical identity. Store both the raw email and the normalized email - you'll need the raw version for communicating back to source systems.
Sample SQL for Deduplication
Step 1: Find deterministic matches (exact normalized email across different source apps):
SELECT
a.id AS record_a,
b.id AS record_b,
a.email_normalized AS matched_on,
'exact_email' AS match_type
FROM user_source_records a
JOIN user_source_records b
ON a.email_normalized = b.email_normalized
AND a.source_app <> b.source_app
AND a.id < b.id -- avoid duplicate pairs
WHERE a.email_normalized IS NOT NULL
AND a.canonical_id IS NULL;Step 2: Fuzzy match on name within the same email domain (requires the fuzzystrmatch extension in PostgreSQL):
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
SELECT
a.id AS record_a,
b.id AS record_b,
levenshtein(
lower(a.first_name || ' ' || a.last_name),
lower(b.first_name || ' ' || b.last_name)
) AS edit_distance,
'fuzzy_name' AS match_type
FROM user_source_records a
JOIN user_source_records b
ON a.source_app <> b.source_app
AND a.id < b.id
WHERE
-- Blocking key: same email domain
split_part(a.email_normalized, '@', 2) = split_part(b.email_normalized, '@', 2)
-- Threshold: max 2 edits
AND levenshtein(
lower(a.first_name || ' ' || a.last_name),
lower(b.first_name || ' ' || b.last_name)
) <= 2
AND a.canonical_id IS NULL;Step 3: Assign canonical IDs to matched record clusters:
-- For each cluster of matched records, create a canonical user
-- and link all source records to it.
-- Use a connected-components algorithm to group transitive matches.
-- Simplified: for each deterministic match pair, assign or merge canonical IDs
WITH new_canonicals AS (
INSERT INTO canonical_users (primary_email, display_name)
SELECT DISTINCT ON (a.email_normalized)
a.email_normalized,
a.first_name || ' ' || a.last_name
FROM dedupe_matches m
JOIN user_source_records a ON a.id = m.record_a
WHERE a.canonical_id IS NULL
RETURNING canonical_id, primary_email
)
UPDATE user_source_records usr
SET canonical_id = nc.canonical_id
FROM new_canonicals nc
WHERE usr.email_normalized = nc.primary_email
AND usr.canonical_id IS NULL;Conflict Resolution Policies
When two source systems disagree about the same field - say Salesforce has "Jonathan" and Okta has "Jon" for first name - you need explicit rules for which value wins.
Source priority is the simplest and most predictable strategy. Rank your sources by trustworthiness for each data type:
# Define trust hierarchy per field type
SOURCE_PRIORITY = {
'email': ['okta', 'google_workspace', 'azure_ad', 'salesforce', 'hubspot'],
'name': ['workday', 'bamboohr', 'okta', 'salesforce', 'hubspot'],
'title': ['workday', 'bamboohr', 'salesforce', 'hubspot'],
'phone': ['workday', 'salesforce', 'hubspot'],
'department': ['workday', 'bamboohr', 'okta'],
}
def resolve_field(field_name: str, values_by_source: dict) -> str | None:
"""Pick the winning value using source priority."""
priority = SOURCE_PRIORITY.get(field_name, list(values_by_source.keys()))
for source in priority:
value = values_by_source.get(source)
if value and str(value).strip():
return value
return NoneThe logic: identity providers (Okta, Azure AD) are the most authoritative source for email. HRIS systems (Workday, BambooHR) are the most authoritative for name and department. CRMs fill in gaps.
Other resolution strategies:
- Most recently updated: Use the value from whichever source synced most recently. Simple but risky - a bad sync overwrites good data.
- Most complete: For array fields (phone numbers, addresses), merge all non-empty values from all sources rather than picking one winner.
- Manual review queue: For high-value fields where automated resolution is risky, flag conflicts for human review.
Edge Cases to Handle
Multiple emails per person: A user might have john@acme.com in Salesforce and john.doe@acme.com in Okta. Store all emails in an array on the canonical record and normalize all of them for matching. Use the all_emails JSONB column in user_source_records for systems that expose multiple email addresses per user.
Shared accounts: Some systems have service accounts or shared mailboxes (support@acme.com, billing@acme.com). Exclude these from person-level deduplication. Maintain a blocklist of known shared-account patterns, or flag any email that appears as a primary address on more than N source records from the same app.
Name changes: Employees change names (marriage, legal name change). If you rely on name as a secondary matching signal, a name change can break an existing match. Always prefer email or employee ID as the primary matching key. Treat name as a confirming signal, not a primary one.
Rehires and transfers: Someone leaves and comes back. Their old records might be deactivated across systems while new records are created. Match on email first - if an existing canonical record exists, reactivate it rather than creating a duplicate.
Stop Hardcoding Your CRM Integrations
If you're currently maintaining a pile of if (provider === 'salesforce') branches, or evaluating how to build the CRM integrations your B2B sales team actually asks for, here's the strategic takeaway:
- Design your canonical schema first — get the entity relationships right (Accounts, Contacts, Opportunities, Pipelines, Stages) before obsessing over individual field mappings.
- Build for custom fields from day one — a normalization layer that drops custom data is only solving the easy half of the problem.
- Normalize API behavior, not just data shapes — pagination, auth, rate limiting, webhooks, and error handling all need abstraction.
- Prefer configuration over code — every
if/elsebranch you write for a specific CRM is a maintenance liability that grows linearly with your integration count. - Implement a multi-level override system — so customers can customize mappings for their specific CRM setup without filing engineering tickets.
- Plan for deduplication from the start — pulling user data from multiple SaaS apps is only half the job. Deterministic matching, normalization rules, and conflict resolution turn raw records into a reliable canonical user directory.
Data normalization should be handled by a configuration-driven unified API. By treating integration behavior as data rather than code, you decouple your application from the chaotic reality of third-party APIs. Your engineering team should be building your core product, not reading terrible API documentation.
FAQ
- What is CRM data normalization?
- CRM data normalization is the process of translating disparate data models from different CRM platforms (Salesforce, HubSpot, Pipedrive, etc.) into a single, canonical JSON schema. It abstracts away provider-specific endpoints, field naming conventions, and API behaviors so developers can interact with one standard data model.
- What are the biggest differences between Salesforce and HubSpot data models?
- Salesforce uses PascalCase fields, a separate Lead/Contact object split, SOQL for filtering, and a `__c` suffix for custom fields. HubSpot uses a flat `properties` object, no separate Lead entity (using a Lifecycle Stage property instead), `filterGroups` for search, and stores custom fields alongside standard properties.
- Why do rigid unified APIs fail with custom CRM fields?
- Rigid unified APIs rely on static 1-to-1 key-value mapping that only covers standard fields. When a customer uses custom fields or objects unique to their CRM instance, the rigid abstraction drops that data, forcing engineers to build custom point-to-point integrations anyway.
- How do you handle custom fields when normalizing CRM data?
- Use a multi-level override system that allows mappings to be customized at platform, environment, and individual account levels. Each level deep-merges on top of the previous one, so customers can add their own custom field mappings without requiring code changes or deployments.
- How does JSONata help with API integrations?
- JSONata is a functional query and transformation language for JSON that acts as a universal transformation engine. It lets you map complex JSON responses and queries declaratively without writing hardcoded execution logic, making integration behavior a data operation rather than a code deployment.