What Should Live in the MDM Hub, and What Shouldn’t

What belongs in an MDM hub?

Most master data hubs fail for a simple reason: teams treat the hub like a storage problem, assuming more fields means more value.

That mindset creates a bloated hub that is slow, confusing, and hard to trust.

A strong MDM hub has one job.

It delivers clean, trusted, reusable entity data that works for operations and analytics. It does that by drawing hard boundaries.

This post gives you a formal way to draw those boundaries.

You will get:

A decision matrix you can use in design reviews
Clear rules for gray area attributes
Patterns that keep your hub lean
A list of common anti patterns you can spot fast
Tips for early design and mid implementation cleanup

Last week, we explored why operational and analytical models should stay separate. Operational models capture what happens. Analytical models explain what it means. When those roles blur, performance drops and logic gets messy.

Now we move one layer deeper. Even with clean operational and analytical separation, many teams turn the master data hub into a catch-all. Transactions sneak in. Metrics creep in. App-specific fields pile up. This week, we draw clear boundaries around the hub so it stays lean, trusted, and usable.

What the hub is, in plain terms

An MDM hub is the system that manages your core business entities:

Customer
Product
Vendor
Employee
Location
Asset

The hub exists to do three things well:

Resolve identity across sources
Choose survivable values for key attributes
Distribute governed master data to consumers

That is it.

The hub is not a data warehouse.

It is not a staging area.

It is not an audit store.

It is not a transaction ledger.

It is not your analytics engine.

When you ask the hub to be those things, you trade focus for sprawl.

The real cost of hub bloat

Hub bloat is not just messy modeling.

It creates technical and org damage.

Here is what it looks like in the real world.

Performance drops first

Wide tables, constant updates, and heavy payloads slow everything down:

Matching takes longer
Merge jobs back up
APIs start timing out
UI screens load slowly
Indexing and storage costs climb

Then teams start tuning around the problem instead of fixing scope.

That buys time but it does not solve the design issue.

Survivorship becomes hard to explain

Survivorship works best on stable identity and classification fields.

When you add volatile fields, survivorship turns into guesswork:

Which system wins for “last activity date”?
Which value is correct for “lifetime spend”?
What does “preferred product” mean this month?

The hub can store a value.

It cannot magically supply the business meaning.

Trust takes a hit

Once users find one wrong field, they doubt the rest.

That is how a hub becomes underused.

People go back to spreadsheets.

Teams build shadow masters in downstream apps.

Your “single source of truth” becomes a slogan.

Integrations get brittle

Every extra field becomes part of someone’s contract.

Then scope creep creates churn:

More mappings
More schema changes
More edge case logic
More rework during releases

A lean hub makes integrations stable.

A bloated hub makes integrations fragile.

The hub inclusion decision matrix

Use this matrix in every modeling session.

It gives you a consistent way to decide.

Step 1: Classify the data element

Every candidate field falls into one of these buckets:

Core identifiers
Canonical attributes
Status and lifecycle attributes
Classification attributes
Relationship and hierarchy data
Reference code values
Transactional event data
Derived metrics and model outputs
Application specific fields

Step 2: Apply the matrix

Data element type	Include in hub	Why
Core identifiers (global ID, source IDs, alternate keys)	Yes	Required for identity, matching, and linkages
Canonical attributes (legal name, status, type, key descriptors)	Yes	Defines the entity and is reused across systems
Survivorship metadata (source trust, timestamps, stewardship flags)	Yes	Explains why a value exists and how it was chosen
Relationships and hierarchies (parent child, legal entity links)	Yes	Needed for rollups, grouping, and cross entity rules
Reference data definitions (country list, status code list)	No	Manage separately, link via codes and lookups
Transactional data (orders, invoices, events)	No	High volume and volatile, belongs in systems of record or warehouses
Derived metrics (LTV, spend, averages, model scores)	No	Depends on rules and time windows, changes often
App specific fields (UI settings, workflow flags for one tool)	No	Not broadly reusable, creates noise and ownership confusion

If you want one line to remember, use this:

Store “who or what it is.”

Do not store “what happened” or “what we calculated.”

The five question test for gray areas

Some fields do not fit cleanly.

That is normal.

Gray area examples:

Customer tier
Risk level
Last interaction date
Preferred store
Account health rating

Use these five questions.

Score each as Yes or No.

Is it governed by the business?
Is it used across more than one system or domain?
Does it change slowly, not daily?
Is it essential to identity, classification, or workflow rules?
Would the enterprise break if this value is inconsistent?

How to interpret the result

4 to 5 Yes answers: candidate for the hub
2 to 3 Yes answers: use an extension or satellite
0 to 1 Yes answers: keep it out of the hub

This is not theory.

It stops the endless debate.

How to handle common gray area attributes

Let’s make this practical.

Customer tier

Store it in the hub only if:

The business defines it centrally
It drives major workflows
Many systems need the same tier label

If tier is calculated from spend, do not store raw spend.

Store the tier label, plus metadata about who assigns it and when.

Risk level

If risk level is a stable category used across teams, it can belong.

If it is a model score that changes often, keep it out.

Store the logic location, not the numeric score.

A simple compromise that works well:

Hub stores a coarse category, like Low, Medium, High
Analytics stores the detailed score and features

Last interaction date

This is an event.

It changes constantly.

Do not store the raw timestamp in the core master record.

If a team needs “recency” for segmentation, use one of these:

A derived view outside the hub
A satellite table updated on a known cadence
A flag like “Active in last 30 days” owned by marketing

Pick one pattern and document it.

Master data vs reference data

Teams mix these up all the time.

It causes bloat and weak governance.

Master data is entity centric.

Reference data is code and value centric.

A clean way to separate them

Master data:

Customer
Vendor
Product
Location

Reference data:

Country codes
Currency codes
Status codes
Product category lists
Payment terms

Your master record should store reference codes, not reference definitions.

Example:

Customer.CountryCode = “US” belongs in the hub
The full list of countries does not belong in the customer master table

Reference data still needs governance.

It just needs different governance.

It usually needs:

Version control
Approval to add or retire values
Mapping rules across systems

Patterns that prevent hub bloat

Scope rules are not enough.

You also need modeling patterns that enforce boundaries.

Pattern 1: Core plus satellites

Keep the core entity lean.

Push optional groups into satellite tables.

Good satellite candidates:

Contact details
Alternate addresses
Communications preferences
Secondary classifications

This improves performance and keeps the core model readable.

Pattern 2: Extension tables for edge teams

Some teams need local fields.

That is fine.

Do not pollute the canonical model to support one workflow.

Use extension tables linked to the master ID.

That gives you:

Flexibility for local needs
Clear ownership boundaries
No impact to other consumers

Pattern 3: Canonical model discipline

A canonical model is a contract.

Treat it like one.

If a new attribute is proposed, require:

A business owner
A definition
A reason it is shared
A decision on survivorship
A distribution plan

If those do not exist, the field does not enter the hub.

Pattern 4: Keep analytics in analytics

Your warehouse, lakehouse, or marts should own:

Metrics
Aggregates
Model outputs
Time window calculations

The hub should publish stable IDs and core attributes.

Analytics joins those to facts and metrics.

That is a clean separation.

Anti patterns you should watch for

These show up in real programs, often in year one.

Anti pattern 1: The kitchen sink hub

This is the hub that tries to answer every question.

It usually starts with good intent.

Then every team adds “just one more field.”

Six months later, nobody knows what is authoritative.

Anti pattern 2: Metrics in the golden record

LTV, spend, propensity, and risk scores in the hub create confusion.

They go stale.

They depend on context.

They change when logic changes.

Users will treat them as truth anyway.

That is why this is dangerous.

Anti pattern 3: App flags in the global model

If a field exists only for one tool, it does not belong in the shared hub.

It creates:

Ownership fights
Update conflicts
Dead fields that never get cleaned up

Anti pattern 4: Transaction stuffing

Orders, invoices, shipments, and interactions do not belong in the hub.

They belong in systems of record and analytics stores.

The hub should reference them by ID when needed.

Do not duplicate the records.

Anti pattern 5: No gate for scope changes

If anyone can add fields, you will get bloat.

You need a scope gate.

Make it formal.

Make it routine.

Make it fast.

Early design checklist

If you are starting fresh, do these before you build.

Define the hub purpose in one paragraph
List in scope domains and out of scope data types
Build the canonical model for one domain first
Create an attribute intake process
Define system of record per attribute
Decide how reference data will be managed
Decide how extensions will be handled
Decide how metrics will be exposed to consumers

This keeps you from designing a hub that is hard to unwind later.

Mid implementation cleanup tips

If you are already building, you can still fix scope.

1. Run a field usage audit

Find:

Fields that are always null
Fields used by only one consumer
Fields sourced from only one system
Fields that change constantly
Fields that users complain about

Those are scope red flags.

2. Split the model instead of tuning forever

If your core tables are wide and slow, refactor:

Move optional groups to satellites
Move app fields to extensions
Move metrics out to marts or views

This is real work, but it pays off.

3. Deprecate before you delete

If a field must go, do it in stages:

Stop populating it
Mark it deprecated in docs
Update consumers
Remove it after a safe window

That reduces risk.

4. Rebuild trust with a smaller promise

If your hub lost credibility, narrow the promise:

We master identity
We master core descriptors
We master hierarchies

Then deliver that cleanly for 60 days.

Trust comes back when the data stays right.

Example: vendor hub scope

Here is a clean vendor scope most teams can agree on.

In the hub

Global vendor ID
Source system IDs
Legal entity name
Tax classification
Status (Active, Suspended)
Country and currency codes
Primary contact email
Parent child vendor relationships

Out of the hub

Open invoices
Payment history
Last order date
Spend to date
Discount logic
Email open and click behavior
Campaign participation details

That vendor master is useful and is also stable.

Downstream systems can join to spend and invoices elsewhere.

Your next step

If you want a hub that stays healthy, do two things.

First, decide scope with a repeatable test.

Second, enforce scope with a model pattern.

Use the decision matrix above in every design review.

Use satellites and extensions to handle edge needs.

Keep metrics and transactions out of the golden record.

Your hub will run faster, match better, and earn trust.