What Should Live in the MDM Hub, and What Shouldn’t
What belongs in an MDM hub?
Most master data hubs fail for a simple reason: teams treat the hub like a storage problem, assuming more fields means more value.
That mindset creates a bloated hub that is slow, confusing, and hard to trust.
A strong MDM hub has one job.
It delivers clean, trusted, reusable entity data that works for operations and analytics. It does that by drawing hard boundaries.
This post gives you a formal way to draw those boundaries.
You will get:
- A decision matrix you can use in design reviews
- Clear rules for gray area attributes
- Patterns that keep your hub lean
- A list of common anti patterns you can spot fast
- Tips for early design and mid implementation cleanup
Last week, we explored why operational and analytical models should stay separate. Operational models capture what happens. Analytical models explain what it means. When those roles blur, performance drops and logic gets messy.
Now we move one layer deeper. Even with clean operational and analytical separation, many teams turn the master data hub into a catch-all. Transactions sneak in. Metrics creep in. App-specific fields pile up. This week, we draw clear boundaries around the hub so it stays lean, trusted, and usable.
What the hub is, in plain terms
An MDM hub is the system that manages your core business entities:
- Customer
- Product
- Vendor
- Employee
- Location
- Asset
The hub exists to do three things well:
- Resolve identity across sources
- Choose survivable values for key attributes
- Distribute governed master data to consumers
That is it.
The hub is not a data warehouse.
It is not a staging area.
It is not an audit store.
It is not a transaction ledger.
It is not your analytics engine.
When you ask the hub to be those things, you trade focus for sprawl.
The real cost of hub bloat
Hub bloat is not just messy modeling.
It creates technical and org damage.
Here is what it looks like in the real world.
Performance drops first
Wide tables, constant updates, and heavy payloads slow everything down:
- Matching takes longer
- Merge jobs back up
- APIs start timing out
- UI screens load slowly
- Indexing and storage costs climb
Then teams start tuning around the problem instead of fixing scope.
That buys time but it does not solve the design issue.
Survivorship becomes hard to explain
Survivorship works best on stable identity and classification fields.
When you add volatile fields, survivorship turns into guesswork:
- Which system wins for “last activity date”?
- Which value is correct for “lifetime spend”?
- What does “preferred product” mean this month?
The hub can store a value.
It cannot magically supply the business meaning.
Trust takes a hit
Once users find one wrong field, they doubt the rest.
That is how a hub becomes underused.
People go back to spreadsheets.
Teams build shadow masters in downstream apps.
Your “single source of truth” becomes a slogan.
Integrations get brittle
Every extra field becomes part of someone’s contract.
Then scope creep creates churn:
- More mappings
- More schema changes
- More edge case logic
- More rework during releases
A lean hub makes integrations stable.
A bloated hub makes integrations fragile.
The hub inclusion decision matrix
Use this matrix in every modeling session.
It gives you a consistent way to decide.
Step 1: Classify the data element
Every candidate field falls into one of these buckets:
- Core identifiers
- Canonical attributes
- Status and lifecycle attributes
- Classification attributes
- Relationship and hierarchy data
- Reference code values
- Transactional event data
- Derived metrics and model outputs
- Application specific fields
Step 2: Apply the matrix
| Data element type | Include in hub | Why |
|---|---|---|
| Core identifiers (global ID, source IDs, alternate keys) | Yes | Required for identity, matching, and linkages |
| Canonical attributes (legal name, status, type, key descriptors) | Yes | Defines the entity and is reused across systems |
| Survivorship metadata (source trust, timestamps, stewardship flags) | Yes | Explains why a value exists and how it was chosen |
| Relationships and hierarchies (parent child, legal entity links) | Yes | Needed for rollups, grouping, and cross entity rules |
| Reference data definitions (country list, status code list) | No | Manage separately, link via codes and lookups |
| Transactional data (orders, invoices, events) | No | High volume and volatile, belongs in systems of record or warehouses |
| Derived metrics (LTV, spend, averages, model scores) | No | Depends on rules and time windows, changes often |
| App specific fields (UI settings, workflow flags for one tool) | No | Not broadly reusable, creates noise and ownership confusion |
If you want one line to remember, use this:
Store “who or what it is.”
Do not store “what happened” or “what we calculated.”
The five question test for gray areas
Some fields do not fit cleanly.
That is normal.
Gray area examples:
- Customer tier
- Risk level
- Last interaction date
- Preferred store
- Account health rating
Use these five questions.
Score each as Yes or No.
- Is it governed by the business?
- Is it used across more than one system or domain?
- Does it change slowly, not daily?
- Is it essential to identity, classification, or workflow rules?
- Would the enterprise break if this value is inconsistent?
How to interpret the result
- 4 to 5 Yes answers: candidate for the hub
- 2 to 3 Yes answers: use an extension or satellite
- 0 to 1 Yes answers: keep it out of the hub
This is not theory.
It stops the endless debate.
How to handle common gray area attributes
Let’s make this practical.
Customer tier
Store it in the hub only if:
- The business defines it centrally
- It drives major workflows
- Many systems need the same tier label
If tier is calculated from spend, do not store raw spend.
Store the tier label, plus metadata about who assigns it and when.
Risk level
If risk level is a stable category used across teams, it can belong.
If it is a model score that changes often, keep it out.
Store the logic location, not the numeric score.
A simple compromise that works well:
- Hub stores a coarse category, like Low, Medium, High
- Analytics stores the detailed score and features
Last interaction date
This is an event.
It changes constantly.
Do not store the raw timestamp in the core master record.
If a team needs “recency” for segmentation, use one of these:
- A derived view outside the hub
- A satellite table updated on a known cadence
- A flag like “Active in last 30 days” owned by marketing
Pick one pattern and document it.
Master data vs reference data
Teams mix these up all the time.
It causes bloat and weak governance.
Master data is entity centric.
Reference data is code and value centric.
A clean way to separate them
Master data:
- Customer
- Vendor
- Product
- Location
Reference data:
- Country codes
- Currency codes
- Status codes
- Product category lists
- Payment terms
Your master record should store reference codes, not reference definitions.
Example:
- Customer.CountryCode = “US” belongs in the hub
- The full list of countries does not belong in the customer master table
Reference data still needs governance.
It just needs different governance.
It usually needs:
- Version control
- Approval to add or retire values
- Mapping rules across systems
Patterns that prevent hub bloat
Scope rules are not enough.
You also need modeling patterns that enforce boundaries.
Pattern 1: Core plus satellites
Keep the core entity lean.
Push optional groups into satellite tables.
Good satellite candidates:
- Contact details
- Alternate addresses
- Communications preferences
- Secondary classifications
This improves performance and keeps the core model readable.
Pattern 2: Extension tables for edge teams
Some teams need local fields.
That is fine.
Do not pollute the canonical model to support one workflow.
Use extension tables linked to the master ID.
That gives you:
- Flexibility for local needs
- Clear ownership boundaries
- No impact to other consumers
Pattern 3: Canonical model discipline
A canonical model is a contract.
Treat it like one.
If a new attribute is proposed, require:
- A business owner
- A definition
- A reason it is shared
- A decision on survivorship
- A distribution plan
If those do not exist, the field does not enter the hub.
Pattern 4: Keep analytics in analytics
Your warehouse, lakehouse, or marts should own:
- Metrics
- Aggregates
- Model outputs
- Time window calculations
The hub should publish stable IDs and core attributes.
Analytics joins those to facts and metrics.
That is a clean separation.
Anti patterns you should watch for
These show up in real programs, often in year one.
Anti pattern 1: The kitchen sink hub
This is the hub that tries to answer every question.
It usually starts with good intent.
Then every team adds “just one more field.”
Six months later, nobody knows what is authoritative.
Anti pattern 2: Metrics in the golden record
LTV, spend, propensity, and risk scores in the hub create confusion.
They go stale.
They depend on context.
They change when logic changes.
Users will treat them as truth anyway.
That is why this is dangerous.
Anti pattern 3: App flags in the global model
If a field exists only for one tool, it does not belong in the shared hub.
It creates:
- Ownership fights
- Update conflicts
- Dead fields that never get cleaned up
Anti pattern 4: Transaction stuffing
Orders, invoices, shipments, and interactions do not belong in the hub.
They belong in systems of record and analytics stores.
The hub should reference them by ID when needed.
Do not duplicate the records.
Anti pattern 5: No gate for scope changes
If anyone can add fields, you will get bloat.
You need a scope gate.
Make it formal.
Make it routine.
Make it fast.
Early design checklist
If you are starting fresh, do these before you build.
- Define the hub purpose in one paragraph
- List in scope domains and out of scope data types
- Build the canonical model for one domain first
- Create an attribute intake process
- Define system of record per attribute
- Decide how reference data will be managed
- Decide how extensions will be handled
- Decide how metrics will be exposed to consumers
This keeps you from designing a hub that is hard to unwind later.
Mid implementation cleanup tips
If you are already building, you can still fix scope.
1. Run a field usage audit
Find:
- Fields that are always null
- Fields used by only one consumer
- Fields sourced from only one system
- Fields that change constantly
- Fields that users complain about
Those are scope red flags.
2. Split the model instead of tuning forever
If your core tables are wide and slow, refactor:
- Move optional groups to satellites
- Move app fields to extensions
- Move metrics out to marts or views
This is real work, but it pays off.
3. Deprecate before you delete
If a field must go, do it in stages:
- Stop populating it
- Mark it deprecated in docs
- Update consumers
- Remove it after a safe window
That reduces risk.
4. Rebuild trust with a smaller promise
If your hub lost credibility, narrow the promise:
- We master identity
- We master core descriptors
- We master hierarchies
Then deliver that cleanly for 60 days.
Trust comes back when the data stays right.
Example: vendor hub scope
Here is a clean vendor scope most teams can agree on.
In the hub
- Global vendor ID
- Source system IDs
- Legal entity name
- Tax classification
- Status (Active, Suspended)
- Country and currency codes
- Primary contact email
- Parent child vendor relationships
Out of the hub
- Open invoices
- Payment history
- Last order date
- Spend to date
- Discount logic
- Email open and click behavior
- Campaign participation details
That vendor master is useful and is also stable.
Downstream systems can join to spend and invoices elsewhere.
Your next step
If you want a hub that stays healthy, do two things.
First, decide scope with a repeatable test.
Second, enforce scope with a model pattern.
Use the decision matrix above in every design review.
Use satellites and extensions to handle edge needs.
Keep metrics and transactions out of the golden record.
Your hub will run faster, match better, and earn trust.


