Deliver Your Data as a Product, But Not as an Application

Author:Murphy | View: 28626 | Time: 2025-03-22 20:48:55

To deliver your Data as a Product, rather than just a table or file without further business context, is a key principle of the Data mesh framework. Considering whether to deliver data as a product through an application service (API) or as a pure data structure is an important design decision. I've previously examined this specific challenge in part 2 of my three-part series on data mesh. However, this post will discuss the issue beyond the data mesh concept, because I think it is so fundamental. I will outline the key differences and then argue why you should prefer "Data as a Pure Structure" over "Data as an Application."

Data as a Product

The concept of transforming your data into a product isn't new in the realm of data engineering and was used even before the data mesh framework was defined. However, there is an important distinction between creating a product powered by data and treating the data itself as a product – here is a good explanation for the subtle difference between "Data as a Product" and "Data Product". In this post, I focus on "Data As A Product", even if I also use the term data product for brevity.

The vast majority of articles on the concept of data products describe the "Data as an Application" approach. This is unfortunate, as this method has significant drawbacks compared to the "Data as a Pure Structure" approach. Independently from the concept of data products, Yehonathan Sharvit described the principles of using pure data structures in his book called "Data-Oriented Programming" (DOP):

Separating code (behavior) from data.
Treating data as immutable.
Separating data schema from data representation.
Representing data with generic data structures.

While the latter three principles are highly recommended practices at the enterprise level, adherence to the first principle is crucial to ensuring that data products can exist without applications. But let's first explore the differences in detail.

Data as an Application

In the "Data as an Application" approach, data is accessed throug an interface (API) that allows clients to retrieve data by making API calls to an application instance. This application instance can be a bespoke enterprise application delivering specific data or even a full-fledged database or AI model (LLMs are increasingly popular) offering a rather generic data abstraction. Regardless, a running application instance (the "server") is required for client access to the data. This instance manages the data stored in the underlying files, prohibiting direct file access through operating system calls or custom library functions. Instead, you must use the predefined interfaces provided by the application instance.

Data as a Pure Structure

Separating code (behavior) from data requires the definition of data products as pure data structures.

Data and metadata are packaged in a file and separated from applications – Image by author

"Data as a Pure Structure" refers to data that exists independently and outside of any application. It includes all necessary metadata to transform raw data into a product, but remains pure data structures independent of an accompanying application. This data structure can be accessed without making API calls to an application instance. For example, in Unix-like systems, such a structure would be stored as a file (essentially a byte stream), which can be directly accessed via operating system calls. Although these are technically system calls, they differ from API calls since you can't, quite frankly, do anything at all without a running operating system.

Often, we maintain a distributed version of the hierarchical file system at a low level in the operating system to enable distributed file access, such as HDFS in Hadoop. This layer (a data product storage infrastructure if you like) integrates closely with the operating system rather than functioning as a separate application.

It's important to note that applications can optionally include their own data stores in addition to the data products they produce and consume (Application 4 and 5 in the image). Also note, these data products contain comprehensive information about their usage and lineage. For instance, Pure Structure 2 includes details about being created by Application 2 using Pure Structures 1 and 3 as inputs.

The Other Highly Recommended Principles

Okay, we decided to separate the code from data. Without going into detail about how the remaining principles can be implemented specifically – part 2 of the data mesh series has more details on this and I'll certainly write about it in more detail in a future post – let's explore the additional value we gain when we apply the other principles.

Treating Data as Immutable To treat your data as immutable pieces of information does not mean that your data cannot be changed. It means that any change or transformation applied to your data is preserved. The added value is complete lineage of all applied logic back to the source of information.
Separating Data Schema from Data Representation If we do not predefine the schema and model (data representation) but instead derive it from the data itself, we gain extreme versatility and flexibility in our data products. We can, for example, reuse logic on different contents to automatically derive up-to-date data models from the data including the full history of changes to the model.
Representing Data with Generic Data Structures Even highly complex data structures can always be assembled from simpler, more generic data structures through incremental application of transformation logic. The most generic structures are data atoms – an intriguing concept in defining a data atom is the Posit, as suggested by Lars Rönnbäck. If we also describe the business context of raw data using the same fundamental data atoms, we can significantly streamline and generalize data and metadata management within a data product.

What is The Application Trap?

Data and applications work together to effectively digitize our business processes. Applications implement the necessary business logic, while the business objects are stored as data. Data without an application is like crude oil – full of potential value, but not yet usable. Conversely, applications without data are like engines without fuel – full of potential, but unable to function.

Because both parts obviously cannot function without the other, we often bundle them together. Another reason for this bundling could be that most of today's IT engineers were trained in object-oriented programming (OOP). OOP teaches us to encapsulate data with the functions that operate on it within an object. These functions add the business context to the raw data hidden inside the object. The business context includes semantic information about the schema, content, data lineage, frequency of delivery, quality metrics, and other valuable metadata that explain data characteristics and how to use the data. To exchange data together with its business context, we need the complete object; otherwise we lose crucial business information.

Since applications behave like instantiated objects in OOP, we tend to design „Data as an Application".

Why Data is Different

Let's examine the impact of this design decision. Applications can be viewed as instantiated objects running in a specific environment. For instance, applications written in Java, Scala or Kotlin run within the Java Virtual Machine (JVM), while for those programmed with system calls in languages like C, C++, or Rust, the execution environment is the operating system. Whenever we need to interact across these environments – due to different types or because it's running on separate nodes – we typically do not transfer entire applications between the environments. Instead, we use infrastructure services (like networks and protocols such as HTTP) that allow one application to call functions (API calls) from another application in a different execution environment.

Business context or semantic information is quickly lost when data is extracted – Image by author

We have functions that return business information and functions that return raw data. Hence, we can easily extract or separate the raw data from the application and exchange it with others. However, doing so strips away the crucial business context. The receiving application gets only the raw data, lacking information about its origin and detailed meaning.

To prevent this loss of information, we must also include a reference to the original application from which the data was extracted. While theoretically possible, this approach presents several practical challenges:

Data quickly loses its connection to the applications that provide the business context because it is so fluid. It's very easy to aggregate and transform raw data with BI applications or derive something intelligent with AI models, but it's much harder to also maintain full business context including data lineage, when data flows through many different transformation steps. Most applications cannot handle the additional reference to the source application and business context is lost.
Data has a significantly longer lifespan than applications. It's the rule rather than the exception that data can be stored over indefinitely long periods of time, whereas applications are perhaps replaced every 5 to 10 years. It is very likely that you have data in your storage systems that was provided by applications for which neither the operating system nor the required hardware is still available to execute them.
It is very hard to transport application logic between environments, whereas it's extremely easy to do so with data. Yes, you can serialize e.g. Java or Python applications and deserialize them in another environment of the same type and version. But try serializing an old Cobol program running on z/OS and deserializing it on a modern Unix-like system – you'll have a challenge.
The practice of tracking history and different versions of data is quite common and widespread. However, doing the same with running application instances is the exception rather than the rule. Yes, you can track the source code in a code repository. With modern container technology, you can even track all compiled application versions and store them in image registries. But what about all your older and conventional applications that still are widely used? In practice, I have never seen a system that continuously tracked all application versions over time and effectively was able to run an old application version whenever a client needed business context for the data at a certain point in time.
An API effectively predefines the way you can retrieve information from the application. For data requirements that are not provided for in your original API definition, you will need to implement additional API calls. Alternatively, you could try to design your API extremely generic so that every unpredictable future type of data usage is covered by this API. This ultimately leads to applications becoming like generic databases. But even a database of a specific type (e.g. relational database or vector database) will not be generic enough to cover all kinds of today's data requirements – the popularity of NoSQL databases is a clear sign of this fact.

Data should not be treated like an application, because it has these significantly different characteristics. While applications are essential for formatting, aggregating, and transforming data to derive value, the data product itself is not the application. It can be beneficial to supplement your data products with additional applications or library functions to simplify their usage. But keep in mind that the unpredictable variety of requirements can quickly force you to develop "jack of all trades" databases or AI applications that no one has really been able to deliver yet.

Therefore, the "Data as a Pure Structure" approach appears to be the most promising. It maintains data as independent structures with rich business context, allowing for flexible and direct access without the need for a specific application instance. It prevents the complications associated with trying to create all-encompassing server applications and ensures that data remains versatile and usable across various environments and contexts.

To ensure that your data products are universally accessible, the use of open source data structure formats is essential. Popular formats include:

Arrow A column-oriented in-memory data format that is becoming increasingly popular for exchanging data between different programming languages and environments.
ParquetA column-oriented data format optimized for storage on disks and distributed storage systems like HDFS.
Avro A row-oriented data format optimized for storage on disks and distributed storage systems like HDFS.
JSON (JavaScript Object Notation) A very lightweight data format mainly used for data exchange between application server and web-clients – text-based and human readable but not the best solution for big data volumes.
Protocol Buffers (Protobuf) Google's take on a row-oriented data format optimized for storage on disks and distributed systems.

Depending on your use case, there are many more formats available – just keep in mind, that it should be open and commonly used.

If you want to read more about Data Engineering challenges and solutions specifically related to the data mesh (data products are discussed in part 2), see my three-part series on this topic: