With all those kick-ass tools in place, why then, has it been so hard to turn daily data into actual services? Getting your 8am insight? Easy. Automating resolution? 6 months roadmap and 3 headcounts… The fact that you can’t seem to get serviceable insights out of your store ultimately boils down to the “rest” nature of it. So let’s drill into that.
Data directly retrieved from sources is considered raw. This means it is often loosely structured, highly specific to one dimension from one source, lacks overall context and awareness, may be incomplete or just plain unintelligible (to name a fraction of the properties of raw data). It is unexploitable as is. It needs to be modeled. And that’s where the complexity lies.
In the previous part we touched upon ETL and ELT services for data transformation: these are our modeling steps. At a regular interval (typically hourly or daily), these services will retrieve a batch of data from a previous period, apply a defined transformation (aka model) to it and write the output back to disk. You now have usable data for that period. But it’s often a multi-layer process where one model depends on another’s results to exist and so on: for instance the first transform may clean and aggregate, the second one joins multiple sources, the third one applies business rules from one team, the fourth one applies rules from another’s… In complex data models it easily takes hours to run such a chain of dependency (they’re called DAGs for Directed Acyclic Graphs) - imagine all the modeling it takes to get to a simple construct like customer-level profit. Unlike the person reading reports, no production service can realistically wait hours between “go” and “get it”... And that’s where the gap between insights and automation lies.
Many things! Typically Analytics and BI Engineers (even Data Engineers) will spend days fine-tuning individual steps to squeeze out the slightest bit of run-time improvement and allow for more frequent scheduling. But ETLs and ELTs aren’t designed for continuous availability so these efforts just can’t yield the expected result. Current “real-time” proxies for analytics mostly rely on materializing certain steps. This involves a lot of obscure caching, redundant scanning and recomputations - and still rely on batch queries and ETLs as inputs. Some more modern options like Time-Series Database and Differential Dataflow Databases do a great job at addressing this issue at the database level but require migrating your entire OLAP layer to a new technology. This may be fine for some team-level services but migrating an entire company to support a few marginal use-cases is extreme. And why should you? Your data warehouse isn’t broken, it’s doing exactly what it’s supposed to do: provide blazing-fast query performance.
In the end, few companies have mastered the art of building data products and what it really means to be data-driven - it’s not just about data for decisions, it’s also about data for operations. Batch services may be fine for reporting but they hardly generalize beyond it and breed dashboard organizations (heavily reliant on human ops and after-the-fact insights). Ultimately the most efficient strategy these days is for data teams to trim down requirements until what's left of the use-case fits into the tools at hand.
Mastering continuous operations comes at a tremendous cost and usually involves setting up a dedicated engineering organization, building a specialized toolbox and iterating over it. Fine if you have a billion dollar problem on your hands, unrealistic for most. Even then, building data services that go beyond reporting often remain locked in the hands of users with scripting knowledge, quid of Analytics Engineers and their self-service needs. The truth is that data stores have been over sold as miracle solutions and, though great sources of truth, fail organizations at serving data as a product - and, while continuous services do properly address the latter, their technical bar remains too high: slow data is a technical choice and data services generally end up reverse-engineered from the toolbox. With Popsink we aim at lowering the technical bar until slow data is just a tooling choice.
In the next part, we'll see how Popsink complements existing modern data data stacks to enable the delivery of product-ready data models.