Designing Data Lake on Azure

Lot of companies have embarked on the journey of establishing a data lake as an essential component of their data strategy. As organisations are becoming more data centric and data driven, it is essential to have data infrastructure that enables organisations to leverage data driven decision making. Designing a data infrastructure is the key to successful adaption of data in an organisation’s day-to-day decision making. This article discusses the questions, factors and reasoning that should be applied when designing a data lake on Azure cloud.

Data Lake vs Data Warehouse

The first question that a data strategist should ask is Do we need data lake? Organisations have been dependant on data warehouse for decades to address their reporting needs. These were traditional data warehouses setup in relational databases which housed all the relevant data needed for reporting. Data lake’s main purpose is to drive innovation and facilitate exploration. Gartner recognised data lake as an extension of logical data warehouse design, with the difference that it tries to answer unknown questions with unknown data.

Credit: https://blogs.gartner.com/adam-ronthal/files/2018/05/DAIM_DMSA.png

Data lake is a repository of data containing both raw & curated data. It provides a fast access to data scientist and data analyst to explore patterns & relationships within the lake. This enables them to drive innovation and unlock insights. So inherently, data lakes are expected to be big, use cheaper storage than cloud data warehouse, contain different versions of same data and would allow analytics workload to run on data lake. More importantly, it ensures that data warehouse doesn’t unnecessarily contain vast quantities of raw, transient, and curated data; data lake take care of it, so that data warehouse houses cleansed modelled data.

Azure Blob Storage vs Azure Data Lake Store

In Azure, you have two options for data lake storage: Azure Blob storage and Azure Data Lake Storage (ADLS). Knowing the difference between these two options is important as they have few differences in features and that can impact your architectural decisions.

Azure Blob and ADLS Gen2 are provisioned through an Azure Storage Account. Azure Storage Accounts contain four different Azure storage services – Blobs, Queues, Tables, and Files. For designing data lake, we will only evaluate blob storage.

Azure Blob Storage

Azure Blob storage is optimized for storing massive amounts of unstructured data. After creating an Azure Storage account, you can create containers which are used to organize a set of Blobs, like a directory in a file system – similar, but not the same! Blob storage accounts are only capable of mimicking a hierarchical folder structure; they do not support true directories. Blob storage also offers following benefits:

  • Tiered access – Hot, Cool, Archive
  • Low-cost storage option for unstructured data
  • Built-in high-availability and disaster recovery

Azure Data Lake Store (ADLS)

Azure had launched ADLS Gen1, which has now been superseded by Gen2. Gen1 provided compatibility with HDFS, supported ACLs, unlimited storage and access via file system. ADLS Gen2 converges the capabilities of ADLS Gen1 with Azure Blob Storage. For example, ADLS Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, it offers low-cost, tiered storage, with high availability/disaster recovery capabilities.

So how is ADLS Gen2 provisioned? In Azure portal, when provisioning an Azure Storage account, if you check “Enable hierarchical namespace”, ADLS Gen2 would be provisioned. Remember, hierarchical namespace configuration cannot be modified later!

Additional Features in ADLS

The most crucial benefit of using ADLS Gen2 is the addition of a hierarchical namespace to blob storage. The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access. A common object store naming convention uses slashes in the name to mimic a hierarchical directory structure. This structure becomes real with ADLS Gen2. Operations such as renaming or deleting a directory, become single atomic metadata operations on the directory. There’s no need to enumerate and process all objects that share the name prefix of the directory. A hierarchy of directories can efficiently be arranged to represent the various zones, business units, projects and provenance of the data in the lake.

Secondly, ADLS Gen2 allows users of Azure Blob Storage access to a new driver, the Azure Blob File System driver or ABFS. ABFS is part of Apache Hadoop. For connectivity to ADLS Gen2, the ABFS driver utilizes Distributed File System (DFS) endpoint to access data in Azure Blob Storage without any code explicitly referencing Data Lake Storage Gen2. ABFS driver defines its own URI scheme so that resources (directories and files) may be distinctly addressed. The structure of the URI is:

abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/<path>/<file_name>

Thirdly, ADLS Gen2 implements an access control model that supports both Azure role-based access control (Azure RBAC) and POSIX-like access control lists (ACLs). Azure RBAC and ACL both require the user (or application) to have an identity in Azure AD. Azure RBAC lets you grant “coarse-grain” access to storage account data, such as read or write access to all of the data in a storage account, while ACLs let you grant “fine-grained” access, such as write access to a specific directory or file. This feature is not shared with Azure Blob storage, which can only be secured at container level.

Data Lake Design Considerations

As your progress on your journey of building data lake strategy, one of the value propositions of a data lake is to serve single data store for all their analytics scenarios. What needs to be understood is that this single data store is a logical concept rather than a physical realisation. In reality, it can be physically implemented as a single ADLS Gen2 account or multiple storage accounts in different regions. So what are the factors to consider when designing a data lake?

Centralised vs Federated Data Lakes

Many organisations are moving towards establishing data products, an implementation of Data Mesh approach. This requires a federated data infrastructure setup. Other organisations are located in multiple countries and require adherence to data residency laws, ensuring data is stored in within the same jurisdiction. Few organisations have a requirement for as a multi-tenant analytics platform serving multiple customers. In this scenario, there is a contractual obligation to ensure that the customer data and their associated analytics workloads are isolated from other customers to help manage their cost and billing models. Some organisations want to keep storage accounts separate for billing or decentralised management reasons.

Choosing centralised or federated model will be determined by the above factors. However, these requirements can be implemented with one single storage account or multiple storage accounts. Access can be controlled at subscription, account, container or folder level. Having multiple subscriptions, account, container or folder does not incur additional cost; storage and compute would determine the total cost. This is more an administrative, security and cost management decision based on the factors organisation values the most. Following factors would determine how data lake is setup using Azure components:

  • Azure subscription limits & quotas on other resources, and service limits (link)
  • Resource Groups (RG) are in the correct region to ensure compliance to data sovereignty laws
  • Capability of organisation to provide admin Azure support, including managing security. Single storage account would require less overhead in managing & supporting operations such as RBACs, firewall settings, data lifecycle management policies for all the data in storage account. Different accounts could have different security policies & data lake structure, which may require multiple support function or large centralised support group.
  • Contractual or legal obligation that dictates restricted, limited or controlled access to data. This would require separate subscription or resource group to ensure that the customer data and their associated analytics workloads are isolated. This would also allow the organisation to fulfil data governance requirements.
  • Each environment should have separate storage account & ideally different subscriptions.
  • Ensure validity of data isolation requirement, that would require separate subscriptions, resource groups & accounts. Data isolation should not be confused with role based restricted access. For example, finance department would like to restrict data to only finance personnel, however, that is an RBAC requirement and can be managed at container or folder level rather than having separate subscription, resource group or account.
  • Duplication of data across multiple subscriptions, resource group or storage account. If same data is stored in multiple ADLS accounts, then data isolation requirements should be revalidated.

Data Lake Structure

Data organisation in a an ADLS Gen2 account can be done in the hierarchy of containers, directories and files, in that order. Following are few approaches to organising the data in a data lake.

Zones

Zones defines phases in the data lifecycle. Physically they can be defined using either container, storage account or multiple folders within a container. A simple approach may be to start with a few generic zones (or layers) and then build out organically as more sophisticated use-cases arise. Following are some sample zones used commonly in the implementation.

  • Raw Zone: This zone stores the data as received. The ingestion process does not clean or filter data. Some implementations may store it in original format (such as json or csv), however, a better implementation would store it in compressed formats such as avro, parquet or Databricks Delta Lake to managing data storage cost better. The data is usually categorised by Data Source > Data Set > Processing Date. Access to this zone is read-only, with write access to ingestion process service principal. This zone usually contains largest amount of data, therefore, zone data can be moved to cool tier, either programmatically or through a lifecycle management policy.
  • Enriched/Cleansed/Staging Zone: This zone stores the data in native data type, applies data validation, removes duplicate or derivable columns and applies column standardisation. It may also add new dataset that combines existing dataset to link or aggregate data for performance or update efficiency purposes. Data is always stored in compressed formats such as avro, parquet and Databricks Delta tables, and the ingestion process applies insert, update & merge logic. Datasets are usually partitioned by data value such as transaction date, location or range, which is not linked to ingestion process (e.g., ingestion date). The ingestion process also uses datasets in this zone to identify delta or CDC, which is useful if another ingestion process offshoots from this zone and updates cloud data warehouse e.g., Snowflake data warehouse. Data scientist and engineers may have read-only access to this zone.
  • Modelled/Curated Zone: This zone is equivalent to main facts and dimensions layer in the data warehouse. Data in this zone is fit for consumption by data scientist or BI/data analyst. Data has been modelled either as denormalised data marts or star schemas or One Big Tables (OBT) addressing specific reporting/analytics requirements. Data is referenced & validated, well documented along with governance applied on these datasets. This zone acts as single source of truth for enterprise data lakes. As data lake costs are lower than cloud data warehouse, this zone can store lower granular linked version of the oldest data, with cloud data warehouse storing aggregated recent data. Data in this zone can be organised by business unit, data product or consumers.
  • Sandbox/Workspace/Laboratory zone: This is an optional zone, more popular with data scientist. This zone allows data scientist, and/or data engineers/analyst to bring other datasets and use it with curated data to explore & prototype insights and model development. This is an exploratory area using production data & governance, not equivalent to test/UAT environments which cannot hold production data. Data is typically organised by data sources, projects & teams. Users typically have read-write access to this zone.
Folder Structure/Hierarchy

Folder defines how data is organized within a zone. Factors that determine the folder hierarchy include:

  • Logical separation e.g., transaction vs ingestion date
  • Ingestion pattern e.g., daily, weekly, monthly
  • Partition strategy
  • Access control model
  • Data sources
  • Zones
  • Sensitive sub-zones e.g., General, Sensitive

Let’s see what sample folder hierarchy would look like in each zone:

  • Raw Zone: Folder hierarchy in this zone is typically based on data source, dataset & ingestion pattern. There may optionally be a landing folder within this zone, organized by either data source or current date timestamp. Landing zone is a temporary storage of any data received. In the raw zone, data is typically organized in the following hierarchy:

\Raw\<Sensitivity level>\<Data Source>\<Dataset>\<Ingestion Date YYYY>\<Ingestion Date MM>\<Ingestion Date DD>

  • Enriched/Cleansed/Staging Zone: Folder hierarchy in this zone can vary based on organisation’s data lake maturity. For organizations with dedicated matured data lake capabilities, folder hierarchy can be based on subject area or business unit. For organisations with lesser maturity or who want to use this zone for CDC, to push updates to cloud data warehouse or other systems, then folder hierarchy should preferably be based on data source and dataset. Within dataset, folder hierarchy is dependent on partition strategy, which is most cases is automatically managed by ingestion process. In most implementations, this zone acts as single source of cleansed records, hosting data from inception. Following are sample folder hierarchies:

Option 1: \cleansed\<Sensitivity level>\<Subject area/BU>\Dataset\<Partition key e.g., transaction date>

Option 2: \cleansed\<Sensitivity level>\<Data Source>\Dataset\<Partition key e.g., transaction date>

  • Modelled/Curated Zone: Folder hierarchy in this zone is determined by modelling methodology. This zone can store data as models, which can be either star schema based or Operational Data Store (ODS) normalized format. These models would have high level categorization, often referred as subject area, e.g., procurement, asset management, etc., or organized by business unit specific data. This zone hosts single source of truth, incorporating any data treatment which gives it correct definition. Following are sample folder hierarchies:

\curated\<Subject Area\BU>\<Domain>\<Model>\<Partition key>

  • Sandbox/Workspace/Laboratory zone: This zone is more a “play” area for the analysts. So folder hierarchy is focused on teams and users. Additionally, it can have additional folder for storing a new external data that data scientist would like to experiment with. Following are sample folder hierarchies:

Option 1: \sandbox\<Team>\<User>

Option 2: \sandbox\<External Data Source>\<Dataset>

File Formats

Data can be ingested in various formats. Data files can be in human readable formats such as JSON, CSV, or XML or as compressed binary formats such as .tar.gz. Data lakes also contain unstructured data such as images, audio & video files.

While ADLS Gen2 supports storing all kinds of data without imposing any restrictions, it is better to think about data formats to maximize efficiency of your processing pipelines and optimize costs. Choosing the most appropriate format will often be a trade-off between storage cost, performance and the tools used to process and consume data in the lake. The type of workloads may also influence the decision, such as real-time/streaming, append-only or DML heavy.

For Event Hub or Kafka based data sources, that are write heavy or query patterns favour retrieving multiple rows of records in their entirety, Avro file format is better. In read heavy scenarios, where the query returns subset of columns, instead of entire record, Parquet or Delta Lake table is better. Delta lake tables are preferred choice for curated and enriched/staging zone as they have ACID properties, and can result in better performance because of predicate pushdown/file skipping and column pruning. Choosing right partition key is very important for ensuring good performance for Delta tables or parquet files.

Raw zone data can be either saved in original format or parquet format. Parquet supports efficient data compression and encoding schemes that can lower data storage costs. Costs are reduced due to the shorter compute (Spark or Data Factory) times but also due to optimal read operations. For example, files greater than 4 MB in size incur a lower price for every 4 MB block of data read beyond the first 4 MB. To read a single file that is 16 MB is cheaper than reading 4 files that are 4 MB each.

When processing data with Spark the typical guidance is around 64MB — 1GB per file. It is well known in the Spark community that thousands of small files (kb in size) are a performance nightmare. In the raw zone this can be a challenge, particularly for streaming data which will typically have smaller files/messages at high velocity. Files will need to be regularly compacted/consolidated or for those using Databricks Delta Lake format, using OPTIMIZE or even AUTO OPTIMIZE can help.

Introduce an intermediate data lake zone/layer between raw and cleansed which periodically takes uncompressed and/or small files from raw, and compacts them into larger, compressed files in this new layer. If raw data ever needs to be extracted or analysed, these processes can run more efficiently against this intermediate layer rather than the raw layer.

Use lifecycle management to archive raw data to reduce long term storage costs without having to delete data.

In summary, with larger data volumes and greater data velocity, file formats are going to play a crucial role in ingestion and analytical performance. In the raw zone, where there is a greater likelihood of an accumulation of smaller files, particularly in IoT scale scenarios, compression is going to be another important consideration. Leaving files in raw format such as json or csv may incur a performance or cost overhead.

Identity and Access Management

ADLS Gen2 supports the following authorization mechanisms:

  • Shared Key authorization
  • Shared access signature (SAS) authorization
  • Role-based access control (Azure RBAC)
  • Access control lists (ACL)

Shared Key and SAS authorization are only used in exceptional cases as they give access to non-AAD (Azure Active Directory) accounts, and Azure RBAC and ACLs have no effect.

Azure RBAC and ACL both require the user (or application) to have an identity in AAD. Azure RBAC lets you grant “coarse-grain” access to storage account data, such as read or write access to all of the data in a storage account, while ACLs let you grant “fine-grained” access, such as write access to a specific directory or file. RBAC permissions are evaluated at a higher priority than ACLs so if the same user has both, ACLs will not be evaluated. You cannot use an ACL to restrict access that has already been granted by a role assignment.

RBAC

Azure role-based access control (Azure RBAC) helps you manage who has access to Azure resources, what they can do with those resources, and what areas they have access to. Azure RBAC is an authorisation system built on Azure Resource Manager that provides fine-grained access management to Azure resources. Access to resources is controlled by assigning Azure roles, that is how permissions are enforced. A role assignment consists of three elements: security principal, role definition, and scope.

  • Security principal: An object that represents a user, group, service principal, or managed identity that is requesting access to Azure resources
  • Role definition: A collection of permissions, also referred simply as a role. It lists the actions that can be performed, such as read, write and delete.
  • Scope: Set of resources that the access applies to. Scope can be specified at four levels: management group, subscription, resource group, or resource.

Azure RBAC is usually used at either storage account or container level in ADLS, although it can also be applied at resource group or subscription level.

Roles such as Owner, Contributor, Reader, and Storage Account Contributor permit a security principal to manage a storage account, but do not provide access to the data within that account. However, these roles (excluding Reader) can obtain access to the storage keys, which can be used in various client tools to access the data.

Following are the commonly used built-in RBAC roles:

  • Storage Blob Data Owner: Use to set ownership and manage POSIX access control for Azure Data Lake Storage Gen2.
  • Storage Blob Data Contributor: Use to grant read/write/delete permissions to Blob storage resources.
  • Storage Blob Data Reader: Use to grant read-only permissions to Blob storage resources.

ACL

In ADLS Gen2, each file and directory have a corresponding access control list (ACL). A security principal can have access to a file or directory. Each access level association of a security principal on a file and directory is captured as an entry in an ACL. An ACL check determines whether that security principal has the correct permission level to perform the operation.

Access to the data is implemented using a combination of execute, read and write access permissions at the appropriate folder and file level. Execute is only used in the context of folders and can be thought of as search or list permissions for that folder. In order to access a folder or file, using read or write permissions, every parent folder in the hierarchy till root would need to have an execute permission.

There are two types of ACLs:

  • Access ACL: Control access to a file or a folder
  • Default ACL: Template of ACLs associated with a directory that determine the access ACLs for any child items created under that directory. Files do not have default ACLs.

Security Group

As a best practice, ACLs should be assigned to Azure AD security groups rather than individual users or service principals. Users and Service Principals can then be efficiently added and removed from groups in the future without the need to reapply ACLs to an entire directory structure.

Granting or revoking access to directories and files becomes easy as they just need to be added or removed from the security groups. Dedicated ACL entry is no longer required and granting or revoking access to users no longer requires applying ACLs again on directories or files. With time, changing ACLs can take time to propagate if there is a large amount of data.

Conclusion

Designing a data lake has many phases. The first phase is to determine the purpose of data lake and the use cases it will address. Having good clarity of this aspect is critical for rest of the phases including Azure resource structure for data lake and security & data access patterns. It’s always important to consider current and future ingestion and consumption patterns, the consumers of data on data lake, their security and governance requirements. To avoid converting data lake into a data swamp, a governance process should be put into place to ensure that all existing & new data sources & datasets go through a verification process ensuring adherence to data lake design. The design document should define all patterns the data lake caters for and the qualification criteria for each pattern.

Muhammad Sunil Bokhari – Head of Data and Analytics