Yusuf Aytas - Data as a Product: A New Frontier

Domain-driven design (DDD) has been around for quite a long time. In short, DDD focuses on domain to match domain requirements. One of the pillars of DDD is bounded context. A bounded context is a strategic approach to dealing with large teams with significant domain areas to cover. Bounded context separates one domain from another. It helps to shape teams and responsibilities. Besides, it ensures clear and well-defined contracts between application domains. The microservice architecture establishes itself upon bounded contexts to ensure that services properly align with domains. Many companies depend on bounded context to define service boundaries. It was a matter of time for bounded context to arrive at the data engineering scene. Zhamak Dehghani brought it with her famous article. She coined the term Data as a Product. In her original article, she mentions data as a product as follows.

For a distributed data platform to be successful, domain data teams must apply product thinking with similar rigor to the datasets that they provide; considering their data assets as their products and the rest of the organization’s data scientists, ML and data engineers as their customers.
Zhamak Dehghani

The question is though what does data as a product mean in practical terms? Zhamak Dehghani described features of data as products as discoverable, trustworthy, and secure. We will go over all these features but let’s take a step back. Both microservices and data as products come from the same routes. Shouldn’t they share some? I think they do. Microservices provide an API to talk to the service. What’s an API for data as a product? It’s the table schema. I believe all properties for data as a product closely align with microservices. In the end, we want to provide self-serving APIs. You don’t need to nudge a team to get more information about the API. It’s all there. Now, let’s go through these properties.

Discoverable

Data discovery isn’t all about plugging a search engine to make all available data products with their respective information. It’s also about making sure the data has the correct and recent metadata. The metadata includes information such as their ownership, origin, lineage, partition, location, quality, SLA/SLOs, and so forth. For any search engine to make sense of the data, we need to put correct metadata. Hence, discovery starts at the inception of the data. Teams document the data and provide a common format where it can be moved from one system to another.

Data discovery is the oil to reduce friction to get data insights.
My Tweet

Trustworthy and Truthful

We need to trust data to make data-driven decisions. Produced data should adhere to the data quality requirements that we advertise. Defining SLA/SLOs for different data quality measures becomes essential. In design reviews, teams should explicitly tell their SLA/SLOs based on the customer expectations. The data shouldn’t be marked as ready to consume unless it passes quality checks. Consumers shouldn’t have to think about the truthfulness of the data. Moreover, quality checks should be trustworthy. They should care about the right metrics.

Self-describing

Self-describing data requires the data to reveal all relevant information starting from its respective information ranging to its journey. A good data set would give information about the following items.

data location
data ownership
data format
access permissions
field descriptions
table description
data lineage
data quality metrics
partitioning information
example queries

Secure

One of the common antipatterns in data processing is the global access to a data set. One team may override the data for another team. It opens up many different security concerns. The data access permissions have to be managed globally. The underlying governance patterns have to be established. It’s hard to have a global access permission tool that spans multiple lakes, warehouses, and marts. Nevertheless, it’s essential to provide access controls for compliance and security.

API-Like

APIs are a standard way to scale a functionality. They allow both external and internal systems to consume this functionality. I also see data as a product from an API perspective. They can scale the consumption of data out. As with APIs, data as a product comes with responsibilities such as versioning, availability, reliability, and so forth. In the world of big data, an API can map to a data schema. Once data is delivered, it has to conform to API-like features. We need to benchmark our data as products against APIs. I guess the communication method is different from APIs but the consumption is similar. APIs need critical thinking to publish features or data. We need a similar thinking process for data as a product. One shouldn’t publish additional information unless it’s necessary. Note that it’s easy to add but hard to remove due to versioning.

Feature	Microservice	Data as a Product
Observability	Access to the metrics to identify service problems	Access to the metrics to identify data quality problems
Discoverability	A way to locate services in a network	A way to locate data by stakeholders
Availability	Available to function even if there are failures	Available to the stakeholders at any time through querying mechanisms
Security	Access to the API functionality authenticated and authorized	Access to the data goes through authentication and authorization
Trustworthiness	When a service call succeeds, it executes the underlying procedure successfully	Data complies with data quality requirements
Design for Failure	Failures are part of the design	Failures don’t halt data processing
Data Management	Each microservice ideally manages its own data.	Data management from inception to consumption of data
Domain	Each microservice serves functionality for a domain.	Each data as a product covers a portion or entire domain.
Consumption Method	Each microservice has an API for clients to consume its functionality	Each data as a product provides a schema for data stakeholders to gain data insights

Microservice vs. Data as a Product

Summary

Data as a Product is a new way of thinking about data. It shifts the thinking to API-like concerns for data from data at rest at a table in a data source. The API-like approach requires initiatives to hold producers of data accountable for its quality, advertisement, security, and communication.

References

Book: Domain-Driven Design: Tackling Complexity in the Heart of Software

Article: BoundedContext

Article: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

Book: Designing Big Data Platforms: How to Use, Deploy, and Maintain Big Data Systems

Data as a Product: A New Frontier