When the Catalog Became the Enemy

Streaming catalogs keep growing. Discovery keeps getting worse. The real problem is not the number of titles available but the fragmented data infrastructure underneath them. When behavioral signals, content metadata, rights availability, and subscription context live in disconnected systems, even the best recommendation model is working blind. This post breaks down how a unified data foundation on Databricks changes that equation and what it means for platforms competing on relevance rather than inventory.

Streaming platforms have more content than ever. The problem is no one can find it.

There’s a specific kind of frustration that doesn’t show up in any product roadmap. It happens around 8pm, after a long day, when someone opens a streaming app with twenty thousand titles and can’t find a single thing they want to watch. They spend eight minutes scrolling. Then they close the app and watch something on YouTube.

That moment is costing the industry billions and it’s not a content problem. It’s a data problem.

The content arms race of the last decade was logical at the time. More titles meant more reasons to subscribe. But the math has flipped. Inventory is no longer the differentiator. Relevance is. And relevance, it turns out, requires something most media companies haven’t built yet: a unified view of who their audience is, what they want, and what the platform is actually allowed to show them right now.

The wrong metric built the wrong system

Most recommendation systems were designed to maximize the wrong thing.Watch time sounds like a good proxy for value. It isn’t. A user who watches ninety minutes of content they didn’t particularly enjoy and feels vaguely worse about the platform is not a success story. Neither is the user who starts three things in a row and finishes none of them.

The deeper problem is that these systems are usually built on fragmented data. Behavioral events live in one place. Content metadata lives in another. Rights and licensing restrictions live somewhere else entirely, often in a system that’s never been connected to anything near the recommendation pipeline. Subscription and billing data is a different silo. Ad exposure logs are a different vendor.

When each of those data sources operates independently, the recommendation system is essentially making decisions with a fraction of the picture. It finds patterns in what it can see, which tends to be clickstream and completion data, and ignores everything else. The result is a system that’s good at serving what’s already popular and bad at everything that actually differentiates a platform: long-tail discovery, contextual relevance, new release momentum, and content that fits the mood of this session rather than the average of all sessions.

What changes when the data is unified

This is where Databricks changes the equation. The Data Intelligence Platform isn’t a recommendation engine, it’s the infrastructure layer that makes a good recommendation engine possible by solving the underlying data problem first.

When behavioral signals, content metadata, rights availability, subscription status, ad exposure history, and identity resolution all live on a single lakehouse architecture, the system can ask questions that siloed data makes impossible. Not just “what has this user watched?” but “what is this user likely to want right now, given their plan, their device, their viewing history from the last two weeks, and what’s actually available in their territory tonight?”

That sounds like a small upgrade. In practice it changes the entire shape of the problem.

Behavioral signals stop being misread. A session that ended at minute four because of a buffering event stops looking identical to a session that ended because the content disappointed. Quality-of-experience data, ingested alongside playback events, separates intent signals from infrastructure failures and that separation is what makes the training data honest.

Content cold start becomes tractable. New releases and licensed acquisitions with thin viewing history can be represented through content embeddings, semantic similarity built from synopsis, cast, genre, and editorial tags, rather than waiting weeks for behavioral data to accumulate. The catalog stops being a graveyard for anything released more than two years ago.

Rights constraints become first-class citizens in the serving layer, not afterthoughts. A recommendation that surfaces a title the user can’t access on their plan, or that expired from the catalog yesterday, doesn’t just fail, it erodes trust in the system. On Databricks, rights and availability data can be joined at query time, so the candidate generation layer never surfaces content that can’t be served.

Commercial context gets integrated, not bolted on. As ad-supported tiers have grown to represent close to half of new streaming subscriptions in the US, the recommendation system can no longer optimize purely for content engagement. Ad load tolerance, frequency caps, and inventory value need to inform the ranking logic. Databricks makes it possible to run that optimization in a single pipeline rather than stitching together outputs from disconnected systems.

The two-stage architecture that scales

The pattern that works at production scale and that Databricks’ lakehouse architecture is built to support, separates candidate generation from contextual ranking.

Candidate generation is retrieval: pulling a manageable set of potentially relevant titles from a large catalog efficiently. Ranking is ordering: taking those candidates and scoring them for a specific user in a specific context, with all available signals applied. Keeping these stages separate makes each one independently debuggable, independently tunable, and independently testable in controlled experiments.

It also makes it possible to inject hard constraints (rights restrictions, parental controls, brand safety rules) without contaminating the core model. Databricks Feature Store and Model Serving provide the infrastructure for both stages to share the same feature definitions, so the representations used in training match what’s served in production. That alignment, which is surprisingly hard to maintain in fragmented architectures, is where a significant share of offline-to-online performance gaps originate.

What it looks like in practice

The pattern across media companies that have done this on Databricks is consistent: the outcome isn’t a better algorithm. It’s a better data foundation that makes every downstream system (recommendation, personalization, audience analytics, content performance) more accurate and faster to iterate on. Platforms that have consolidated onto the lakehouse report meaningful improvements in revenue, infrastructure costs, and the speed at which data teams can ship new models.

The practical implication is that the first investment question isn’t “which model should we use?” It’s “can our data actually support a model that works?” Auditing event logging completeness, catalog metadata quality, identity resolution across devices, and rights data freshness typically surfaces more actionable improvements than any model comparison. On Databricks, that audit has a clear path to remediation, not a multi-year migration project but an incremental consolidation that starts delivering value as each data source comes online.

The governance piece that belongs in the architecture

There’s a dimension to personalization in media and entertainment that tends to get treated as a legal department problem. It isn’t.

Identity resolution and cross-device tracking are what make personalization useful. They’re also what create real obligations around consent, purpose limitation, and data minimization, obligations that are increasingly enforced. When the audience includes children, which in most household streaming environments it does, the data practices around commercial use and age-appropriate targeting need to be architectural decisions from day one.

Databricks’ data governance layer (Unity Catalog) provides the lineage, access controls, and audit trail that make compliance not just enforceable but demonstrable. That matters when the regulator asks questions, and recent history suggests they will.

The platforms winning the relevance problem aren’t doing it with a breakthrough algorithm. They’re doing it by building the data foundation that makes every algorithm better and by having the operational discipline to measure what actually changes when the recommendations improve.

The catalog was never the problem. It was always about whether the platform could help someone find what they were looking for before they gave up and went somewhere else. Databricks is where that capability gets built.