Data Mesh or Data Lakehouse, which one to choose?

Comparing Data Mesh to Data Lakehouse is like comparing apples to oranges. Data Mesh is an architectural approach and organizational concept whereas Lakehouse Architecture is a data solution concept which combines best of both, data warehouse and data lake.

Too simple? Not really, there are tons of articles & books defining every aspect of these two topics, but let’s make an attempt to make them simple for us to start on the journey.

Data Lakehouse

Date warehouses concept goes back to 1980s, which stored structured data for reporting and analysis purposes. Products and methodologies for data warehouse were focused on the challenges of that era, slow expensive storage.  Fast forward to 2010, data lake was born, which stored unstructured, semi-structured and structured data in open standardized format, with cheap storage. Unfortunately, they turned into data swamps, primarily because of coarse-grained access control, poor query performance & inadequate controls on schema changes, sometimes leading to data corruption.

All though few vendors had started using the term Lakehouse, it was a blog written by the top tier at Databricks that defined the architecture in 2020. They defined Lakehouse as an open data management architecture, that combines data warehouse and data lakes, along with following key features:

  1. Transaction support, i.e. support ACID transactions
  2. Schema enforcement and governance
  3. BI tools directly query source data, instead of data warehouse
  4. Storage is decoupled from compute
  5. Uses open standard
  6. Support for diverse data types ranging from unstructured to structured data
  7. Support for diverse workloads, including data science, machine learning, and SQL & analytics.
  8. End-to-end streaming

A paper presented by Databricks leadership, in the Conference on Innovative Data Systems Research (CIDR) 2021, provided an example of Lakehouse system design. The system centers around a metadata layer such as Delta Lake that adds transactions, versioning, and auxiliary data structures over files in an open format and can be queried with diverse APIs and engines.

Credit: Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics by Michael Armbrust, Ali Ghodsi, Reynold Xin, Matei Zaharia (http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf)

Lakehouse implementations would usually use Delta Lake, Apache Hudi, Apache Hive 2+, Apache Iceberg or any other format that supports ACID transactions.

Data Mesh

What is Data Mesh? Zhamak Dehgani, founder of Data Mesh, describes it as a decentralized sociotechnical approach in managing and accessing analytical data at scale. She first coined this term in her 2019 article How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh where she presents the case for Data Mesh.

Her argument rests around the fact that while operational systems have successfully applied domain driven design and bounded context, domain concepts have been disregarded in data platforms. Operational systems have, for e.g., moved to microservices architecture, by decomposing systems into distributed services built around business domain capabilities & ownership, but data is still stored in one central repository, call it enterprise data warehouse or data lake. Data teams are hyper-specialized data engineers siloed from the operational units of the organization; separated and grouped into a team based on their technical expertise of data tooling, often absent of business and domain knowledge. They become overwhelmed by the amount of data they need to look after, leading to bottlenecks and delays in implementation and support.

So, how do we resolve this? We distribute data products, oriented around domains, and owned by independent cross-functional teams who have embedded data engineers and data product owners in their teams, using common data infrastructure as a platform to host, prep and serve their data assets.

Data mesh is an intentionally designed distributed data architecture, under centralised governance and standardisation for interoperability, enabled by a shared and harmonised self-serve data infrastructure.

Data Mesh Principles

Zhamak Dehgani, in her article Data Mesh Principles and Logical Architecture, defines four principles that any data mesh implementation should have:

1. Domain Ownership

Organisational units, based on business domains, take ownership of providing analytical data. This will enable the organisation to scale out as each unit would own and deliver analytical data that it produces rather than a centralised data warehousing team. Every node on the mesh corresponds to one domain, owning the delivery of respective analytical data.

2. Data as a product

Domain teams provide data assets or analytical data as a product and rest of the organisation’s data scientists, ML and data engineers are customers of that product. This would allow consumers to discover, understand and securely use high quality data that is distributed across many domains. In data mesh, data product is the architectural quantum, and is the node on the mesh that encapsulates three structural components: a) Code – for data processing, access and governance, b) Data and metadata – analytical and historical data in polyglot form, and c) Infrastructure – execution engine and storage. Following is the logical representation of data product by Dehgani:

Credit: Zhamak Dehgani, Data Mesh Principles and Logical Architecture (https://martinfowler.com/articles/data-mesh-principles.html)

3. Self-serve data platform

Self-serve data infrastructure as a platform to enable domain autonomy. This platform should be domain agnostic and provide data infrastructure components in self-service manner, enabling data teams to quickly create & consume data products.

4. Federated computational governance

Data mesh follows distributed system architecture constituting independent data products managed by independent teams. Therefore, it requires federated governance model that supports decentralisation, independence of data domains but still following interoperability standards that inherently available the platform.

These four principles are the cornerstone of data mesh architecture that targets to co-locate analytical data with operational data by domain. Analytical data would have different technical requirements, including data processing & storage requirements, but these can still be catered while remaining under the umbrella of the respective domain. Dehgani depicts the data mesh logical architecture as follows:

So, Data Mesh or Lakehouse?

Lakehouse architecture is a data solution concept that tries to address all data requirements in one solution, by giving best of data warehouse and data lake. If the requirement is to build one centralized enterprise data solution, Lakehouse architecture would be a good option as it would cater for most use cases.

Data Mesh approach requires decomposition of centralized data teams into domain specific data teams, who own analytical data, aka data product, in that domain. Governance and interoperability standards could be a centralized function, and maybe data infrastructure as a platform can be supported by a centralized team, but the development & maintenance of data product would be decentralized. Therefore, these domain data teams may require a data warehouse or data lake or lakehouse for the delivery and maintenance of each specific data product.

In a Data Mesh logical architecture, a centralized Lakehouse or data lake or data warehouse would not exist. These data processing & storage tools will be used for internal implementation of data products or as part of shared data infrastructure. Performance, size, format types, and complexity will derive the tool selection, and could vary by domain, unless a shared data infrastructure is agreed with all domain owners. Shared data infrastructure would allow sharing of technical knowledge and cost benefits for the organisation.

Data Mesh most importantly requires a mindset change, on how you view the data ecosystem. Dehgani emphasizes that domain data product should be treated as a first-class concern, and data tools and solutions as a second-class concern, i.e., an implementation detail. “This inverts the current mental model from a centralized data lake to an ecosystem of data products that play nicely together, a data mesh. The same principle applies to the data warehouse for business reporting and visualization. It’s simply a node on the mesh, and possibly on the consumer-oriented edge of the mesh.”

Muhammad Sunil Bokhari – Head of Data and Analytics