Learn how to scope an MDM hub the right way. Use a clear decision matrix, avoid hub bloat, handle gray areas, and keep performance and trust high.

What Should Live in the MDM Hub, and What Shouldn’t

What belongs in an MDM hub?

Most master data hubs fail for a simple reason: teams treat the hub like a storage problem, assuming more fields means more value.

That mindset creates a bloated hub that is slow, confusing, and hard to trust.

A strong MDM hub has one job.

It delivers clean, trusted, reusable entity data that works for operations and analytics. It does that by drawing hard boundaries.

This post gives you a formal way to draw those boundaries.

You will get:

  • A decision matrix you can use in design reviews
  • Clear rules for gray area attributes
  • Patterns that keep your hub lean
  • A list of common anti patterns you can spot fast
  • Tips for early design and mid implementation cleanup

Last week, we explored why operational and analytical models should stay separate. Operational models capture what happens. Analytical models explain what it means. When those roles blur, performance drops and logic gets messy.

Now we move one layer deeper. Even with clean operational and analytical separation, many teams turn the master data hub into a catch-all. Transactions sneak in. Metrics creep in. App-specific fields pile up. This week, we draw clear boundaries around the hub so it stays lean, trusted, and usable.

What the hub is, in plain terms

An MDM hub is the system that manages your core business entities:

  • Customer
  • Product
  • Vendor
  • Employee
  • Location
  • Asset

The hub exists to do three things well:

  1. Resolve identity across sources
  2. Choose survivable values for key attributes
  3. Distribute governed master data to consumers

That is it.

The hub is not a data warehouse.

It is not a staging area.

It is not an audit store.

It is not a transaction ledger.

It is not your analytics engine.

When you ask the hub to be those things, you trade focus for sprawl.

The real cost of hub bloat

Hub bloat is not just messy modeling.

It creates technical and org damage.

Here is what it looks like in the real world.

Performance drops first

Wide tables, constant updates, and heavy payloads slow everything down:

  • Matching takes longer
  • Merge jobs back up
  • APIs start timing out
  • UI screens load slowly
  • Indexing and storage costs climb

Then teams start tuning around the problem instead of fixing scope.

That buys time but it does not solve the design issue.

Survivorship becomes hard to explain

Survivorship works best on stable identity and classification fields.

When you add volatile fields, survivorship turns into guesswork:

  • Which system wins for “last activity date”?
  • Which value is correct for “lifetime spend”?
  • What does “preferred product” mean this month?

The hub can store a value.

It cannot magically supply the business meaning.

Trust takes a hit

Once users find one wrong field, they doubt the rest.

That is how a hub becomes underused.

People go back to spreadsheets.

Teams build shadow masters in downstream apps.

Your “single source of truth” becomes a slogan.

Integrations get brittle

Every extra field becomes part of someone’s contract.

Then scope creep creates churn:

  • More mappings
  • More schema changes
  • More edge case logic
  • More rework during releases

A lean hub makes integrations stable.

A bloated hub makes integrations fragile.

The hub inclusion decision matrix

Use this matrix in every modeling session.

It gives you a consistent way to decide.

Step 1: Classify the data element

Every candidate field falls into one of these buckets:

  • Core identifiers
  • Canonical attributes
  • Status and lifecycle attributes
  • Classification attributes
  • Relationship and hierarchy data
  • Reference code values
  • Transactional event data
  • Derived metrics and model outputs
  • Application specific fields

Step 2: Apply the matrix

Data element typeInclude in hubWhy
Core identifiers (global ID, source IDs, alternate keys)YesRequired for identity, matching, and linkages
Canonical attributes (legal name, status, type, key descriptors)YesDefines the entity and is reused across systems
Survivorship metadata (source trust, timestamps, stewardship flags)YesExplains why a value exists and how it was chosen
Relationships and hierarchies (parent child, legal entity links)YesNeeded for rollups, grouping, and cross entity rules
Reference data definitions (country list, status code list)NoManage separately, link via codes and lookups
Transactional data (orders, invoices, events)NoHigh volume and volatile, belongs in systems of record or warehouses
Derived metrics (LTV, spend, averages, model scores)NoDepends on rules and time windows, changes often
App specific fields (UI settings, workflow flags for one tool)NoNot broadly reusable, creates noise and ownership confusion

If you want one line to remember, use this:

Store “who or what it is.”

Do not store “what happened” or “what we calculated.”

The five question test for gray areas

Some fields do not fit cleanly.

That is normal.

Gray area examples:

  • Customer tier
  • Risk level
  • Last interaction date
  • Preferred store
  • Account health rating

Use these five questions.

Score each as Yes or No.

  1. Is it governed by the business?
  2. Is it used across more than one system or domain?
  3. Does it change slowly, not daily?
  4. Is it essential to identity, classification, or workflow rules?
  5. Would the enterprise break if this value is inconsistent?

How to interpret the result

  • 4 to 5 Yes answers: candidate for the hub
  • 2 to 3 Yes answers: use an extension or satellite
  • 0 to 1 Yes answers: keep it out of the hub

This is not theory.

It stops the endless debate.

How to handle common gray area attributes

Let’s make this practical.

Customer tier

Store it in the hub only if:

  • The business defines it centrally
  • It drives major workflows
  • Many systems need the same tier label

If tier is calculated from spend, do not store raw spend.

Store the tier label, plus metadata about who assigns it and when.

Risk level

If risk level is a stable category used across teams, it can belong.

If it is a model score that changes often, keep it out.

Store the logic location, not the numeric score.

A simple compromise that works well:

  • Hub stores a coarse category, like Low, Medium, High
  • Analytics stores the detailed score and features

Last interaction date

This is an event.

It changes constantly.

Do not store the raw timestamp in the core master record.

If a team needs “recency” for segmentation, use one of these:

  • A derived view outside the hub
  • A satellite table updated on a known cadence
  • A flag like “Active in last 30 days” owned by marketing

Pick one pattern and document it.

Master data vs reference data

Teams mix these up all the time.

It causes bloat and weak governance.

Master data is entity centric.

Reference data is code and value centric.

A clean way to separate them

Master data:

  • Customer
  • Vendor
  • Product
  • Location

Reference data:

  • Country codes
  • Currency codes
  • Status codes
  • Product category lists
  • Payment terms

Your master record should store reference codes, not reference definitions.

Example:

  • Customer.CountryCode = “US” belongs in the hub
  • The full list of countries does not belong in the customer master table

Reference data still needs governance.

It just needs different governance.

It usually needs:

  • Version control
  • Approval to add or retire values
  • Mapping rules across systems

Patterns that prevent hub bloat

Scope rules are not enough.

You also need modeling patterns that enforce boundaries.

Pattern 1: Core plus satellites

Keep the core entity lean.

Push optional groups into satellite tables.

Good satellite candidates:

  • Contact details
  • Alternate addresses
  • Communications preferences
  • Secondary classifications

This improves performance and keeps the core model readable.

Pattern 2: Extension tables for edge teams

Some teams need local fields.

That is fine.

Do not pollute the canonical model to support one workflow.

Use extension tables linked to the master ID.

That gives you:

  • Flexibility for local needs
  • Clear ownership boundaries
  • No impact to other consumers

Pattern 3: Canonical model discipline

A canonical model is a contract.

Treat it like one.

If a new attribute is proposed, require:

  • A business owner
  • A definition
  • A reason it is shared
  • A decision on survivorship
  • A distribution plan

If those do not exist, the field does not enter the hub.

Pattern 4: Keep analytics in analytics

Your warehouse, lakehouse, or marts should own:

  • Metrics
  • Aggregates
  • Model outputs
  • Time window calculations

The hub should publish stable IDs and core attributes.

Analytics joins those to facts and metrics.

That is a clean separation.

Anti patterns you should watch for

These show up in real programs, often in year one.

Anti pattern 1: The kitchen sink hub

This is the hub that tries to answer every question.

It usually starts with good intent.

Then every team adds “just one more field.”

Six months later, nobody knows what is authoritative.

Anti pattern 2: Metrics in the golden record

LTV, spend, propensity, and risk scores in the hub create confusion.

They go stale.

They depend on context.

They change when logic changes.

Users will treat them as truth anyway.

That is why this is dangerous.

Anti pattern 3: App flags in the global model

If a field exists only for one tool, it does not belong in the shared hub.

It creates:

  • Ownership fights
  • Update conflicts
  • Dead fields that never get cleaned up

Anti pattern 4: Transaction stuffing

Orders, invoices, shipments, and interactions do not belong in the hub.

They belong in systems of record and analytics stores.

The hub should reference them by ID when needed.

Do not duplicate the records.

Anti pattern 5: No gate for scope changes

If anyone can add fields, you will get bloat.

You need a scope gate.

Make it formal.

Make it routine.

Make it fast.

Early design checklist

If you are starting fresh, do these before you build.

  1. Define the hub purpose in one paragraph
  2. List in scope domains and out of scope data types
  3. Build the canonical model for one domain first
  4. Create an attribute intake process
  5. Define system of record per attribute
  6. Decide how reference data will be managed
  7. Decide how extensions will be handled
  8. Decide how metrics will be exposed to consumers

This keeps you from designing a hub that is hard to unwind later.

Mid implementation cleanup tips

If you are already building, you can still fix scope.

1. Run a field usage audit

Find:

  • Fields that are always null
  • Fields used by only one consumer
  • Fields sourced from only one system
  • Fields that change constantly
  • Fields that users complain about

Those are scope red flags.

2. Split the model instead of tuning forever

If your core tables are wide and slow, refactor:

  • Move optional groups to satellites
  • Move app fields to extensions
  • Move metrics out to marts or views

This is real work, but it pays off.

3. Deprecate before you delete

If a field must go, do it in stages:

  • Stop populating it
  • Mark it deprecated in docs
  • Update consumers
  • Remove it after a safe window

That reduces risk.

4. Rebuild trust with a smaller promise

If your hub lost credibility, narrow the promise:

  • We master identity
  • We master core descriptors
  • We master hierarchies

Then deliver that cleanly for 60 days.

Trust comes back when the data stays right.

Example: vendor hub scope

Here is a clean vendor scope most teams can agree on.

In the hub

  • Global vendor ID
  • Source system IDs
  • Legal entity name
  • Tax classification
  • Status (Active, Suspended)
  • Country and currency codes
  • Primary contact email
  • Parent child vendor relationships

Out of the hub

  • Open invoices
  • Payment history
  • Last order date
  • Spend to date
  • Discount logic
  • Email open and click behavior
  • Campaign participation details

That vendor master is useful and is also stable.

Downstream systems can join to spend and invoices elsewhere.

Your next step

If you want a hub that stays healthy, do two things.

First, decide scope with a repeatable test.

Second, enforce scope with a model pattern.

Use the decision matrix above in every design review.

Use satellites and extensions to handle edge needs.

Keep metrics and transactions out of the golden record.

Your hub will run faster, match better, and earn trust.