Let's have a current state of data platforms and services, including architectures, integrations, transformations, stores, and dashboards.
Feb 23, 2022
We often pitch Popsink as “the easiest way to build real-time data applications”. Although accurate, it fails to convey the transformative nature of the underlying capabilities this enables. Every now and then we go as far as calling ourselves “a real-time ETL solution” but that’s an aberration: it does provide a better mental model of Popsink’s capabilities but real-time and ETL couldn’t be more opposed. So I figured now would be as good as any other time to unpack what it is we REALLY do. A good way to start this exercise is to begin with an assessment of current data architectures.
If you’ve seen any Modern Data Stack diagram (MDS, 2021 buzzword #2, second only to Data Mesh), you probably expected something similar. I have intentionally minimized the amount of SaaS names on that picture to focus on logical aggregate rather than ecosystem overview. So, what are we seeing here?
Data Sources are legion and heterogeneous. Anything that produces a new set of data and that you want to access is by definition a data source, from a flying csv to a microservice’s event stream. It’s this diversity that makes the next job complex. The little caveat here is that in the world of data, source services are often also consumers but we’ll expand on that in a moment.
Integrations are components that get or receive data from a source. Either these processes extract data from a store located somewhere (aka “data at rest”, this is the E in ELT) or listen to a feed that publishes constant units of data called events (aka “data in motion”). Due to the complexity of integrating with every single source out there, a generation of awesome SaaS tools arose that is (or at least was) dedicated to that task: Airbyte, Fivetran, Stitch to name just a few.
Transformations are a bit tricky as there are two schools out there: ETL and ELT (possibly the worst data typo you can make). Where exactly the difference lies can be subject to debate but conceptually ETL refers to the use of a dedicated data processing service whereas ELT refers to using the compute capabilities of your Data Warehouse (DWH).
In process terms, the first approach translates into the use of a service to Extract data from a source, execute a Transformation job using that service’s resources and only then Load its output into a destination. Most integration service vendors are also ETL services as they not only get the data and write it somewhere but also offer the possibility to process datasets in transit.
ELT is different. You use a service above or incorporated with your data warehouse to execute transformations on data already available there. Namely you first Extract everything from source, Load everything into your DWH and only then will you Transform. ELT has been having a moment as it allows analysts to express complex chains of transformations using only SQL and CTAS statements and makes use of the compute capabilities of your DWH rather than a dedicated service. The new generation of tools has been building off those capabilities.
I’m spending a bit more time here since this is where Popsink comes in. Overall, my experience is that nearly every organization leverages both approaches in some capacity: ETL to pre-process data and do the heavy-lifting - traditionally in the hands of engineers - and ELT to refine models and produce metrics datasets - owned by analysts and product teams (ETLT anyone?).
Stores are where data is laid to rest (hence “data at rest”). Want to see data from last year? This is where you can go and get it. Stores are the source of truth of the Batch world (things that happen based on a schedule or trigger). They come in various formats: collections of files of heterogeneous nature and their indexes (Datalakes), vast collections of key-tuple relations (most Data Warehouses) or complex index and metadata abstractions aimed at reproducing the convenience of the latter on the former (Lakehouses, the buzzword of 2020). Just like OLTP or graph services provide persistent data for back-end services, stores now act like a true back-end state for many MDS and enterprise services that constantly call them to retrieve their precious content. Stores can also act as sources in a process called Reverse ETL, which is nothing more than extracting data from a resting place and sending it back to a system (for instance one of the data sources mentioned in Integrations). These days E(T)L and Reverse ETL are merging with vendors building both source and destination connectors across multiple services. That’s for the best.
DWH Tools are products that hook onto your data store and provide a number of capabilities. This is where most of the awesome Modern Data Stack solutions kick in: anomaly detection, metric stores, data catalogs, ELT modeling… It would take us over a day to do a review of that so I just settled for the heresy of aggregating everything under one label. Sorry.
BI / DS Tools mostly apply in the context of data lakes and lake houses as these provide the missing compute resources necessary to perform queries, provide the all-so-convenient SQL and no/low-code abstractions and maintain the indexing necessary to organize and retrieve from immense file repositories. These are collections of tools that typically also enable advanced processing capabilities beyond “just” transforming and extracting data from stores (think statistical modeling, virtualization and the likes).
Consumers aren’t quite the end when thinking about data strategy (they have themselves consumers and can also act as producers) but we’ll stop there from a data flow perspective. This is where the data is used by internal or external users and services, where it is converted into value. A dashboard that helps a customer perform an optimized action, hyperparameters that fine-tune a prediction or an alert over API that pulls the break on a disastrous customer experience event… The possibilities are endless.
In the next part we'll see how these architectures, though great at enabling data for everyone, reach their limit when attempting to build serviceable data products.