Data is king in today’s world, as companies need data to gain insights and an advantage over their competitors. A common challenge with data is that there is simply too much of it, and companies do not know where to start. Data can be spread across various disconnected systems and technologies, from databases to spreadsheets to file systems, etc. Additionally, all of the formats can be quite different also (structured and unstructured), which is another challenge!
To have any hope of getting this data together and analyzing it, a centralized solution is required, and this is where the concept of a Data Lake comes in.
A Data Lake is often defined as:
“ .. a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having first to structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.”
The advantages of a data lake are many such as:
- Improved insight into customers allows companies to focus their strategies better.
- Ability to test products using huge amounts of historical data. Companies can actually test the chances of a product succeeding by checking historical data via analysis tools.
- Reducing costs as a Data Lake will enable them to identify where possible inefficiencies are present in their operational processes.
- Empower Machine Learning and A.I. systems which require large amounts of data to train their models
How do Data Lakes work?
Before we get into securing a Data Lake, let us take a look at its essential components:
- Data movement and ingestion: A Data Lake consumes and ingests data from various sources in real-time, and this process is usually automated without human intervention. These data sources can include mobile devices, apps, databases, and even social media analytics!
- Data Storage: The data is stored in massive data stores and tagged, cataloged, and indexed so it can be easily queried later.
- Data Analytics: The key benefit of a data lake is that massive amounts of data can be analyzed across the company by using a variety of tools to gain deep insights into how the company is performing. Additionally, machine learning tools can leverage this massive data to automate decision-making and provide recommendations to management.
Due to the massive amount of processing and storage that is required for data lakes to function properly, cloud environments are typically the best suited for such solutions. A data lake can be hosted on-prem; however, the cost and logistics can be quite heavy for most companies. Additionally, cloud providers also provide managed services for Data Lakes in which most of the complexities are hidden away from the customer, and they can focus on the data ingestion and analysis parts which can be a huge time saver.
Let us now look at the security aspects of Cloud Data Lakes, for which we will use AWS as an example, as it is the most popular and well-used platform. However, most of these suggestions can be applied to any cloud environment.
AWS Cloud Data Lake Security
The key driving force behind any data lake is, of course, data, and this is where security efforts should be focused. Data should be secure across all touchpoints, from ingestion to processing to storage, so that no unauthorized access can occur.
Cloud security professionals should begin with a risk assessment of the data lake environment, so they identify the entry points where data is being ingested. Special care must also be given to what type of data is being stored. As data lakes give the ability to ingest raw data without any formatting required, it is possible that sensitive information such as PII might get stored without the company’s knowledge. From an AWS perspective, the concept of availability zones should also be understood as data lakes typically store data in S3, which spreads data across multiple availability zones for maximum redundancy.
Once a high-level risk model has been generated, the next focus has to be on access control. Only authorized staff should be able to view, access, and modify aspects of the cloud data lake, and the principle of least privilege has to be enforced across the board.
In AWS Data Lake, S3 is used as the primary storage component, and it has a rich array of features to protect any unauthorized modification or access of the stored data, such as access controls, KMS encryption, versioning, and bucket-based policies.
Let us look at a few in detail:
Access Control: All resources in S3, such as buckets and objects, are private by default allowing only the owner to access them. Permission has to be explicitly granted either via a resource policy or a user policy. Usually, a combination of the two is used to give access to users and services. Additionally, other AWS accounts can also be granted cross-account access via bucket policies which can be fine-tuned to allow access to operations, users, resources, time frame, etc. As a cloud security professional, you should deep dive into user, resource, and bucket policies so that the principle of least privilege can be enforced on the stored data via AWS Identity and Access Management (IAM) service. As per AWS recommendation, user policies should be used for data lake environments so that access is tied to user roles and permissions.
Encryption: Once the access control is finalized, cloud security professionals should look at enforcing encryption so that defense in depth is enforced in the AWS data lake. This will ensure that even if someone gains unauthorized access to the data lake, he or she will not be able to view data. This is crucial as most Data Lakes store sensitive data such as PII or Cardholder information. AWS Data Lake can be secure using the KMS service, which uses encryption keys to secure the underlying data. KMS is a fully managed service that is natively integrated with AWS S3 and simplifies all encryption activities, such as the creation, rotation, and auditing of encryption keys. You can also track the usage of encryption keys via AWS cloudtrail. As a cloud security professional, you can also specify whether to use service-side encryption ( S3 encrypts upon storage ) or client-side encryption ( encrypt before data is stored in S3 ). AWS best practice is to combine both of them and use AWS KMS to manage the keys for the highest level of protection.
Data Tagging: Given the different types of data that are stored and accessed across a company, we need a way of tracking it properly. One of the best controls AWS provides is tagging which allows us to categorize and label stored data and associate controls based on the tag. For example, any data that contains PII or PCI-DSS related data can automatically get tagged with a “data:sensitive” tag, and further controls are enforced based on it. If any AWS user tries to access this data, then IAM can automatically enforce permissions based on this tag.
AWS Lake Formation
If you are serious about data lakes on AWS, then one service that might be of interest is Lake Formation which is a fully managed data lake service by AWS. It automates a lot of the tasks associated with Data Lake and provides you with a central location for configuring granular data access policies. It is always recommended to use managed services as much as possible so that most of the operational activities are moved to the cloud provider, and the company can focus on security and business value. Lake Formation makes it easy to enforce the best practices we discussed in the previous section.
Source: AWS Lake Formation
A data lake is a huge advantage for companies, given the ability to take advantage of the massive data they have stored and gain insights into it. We looked at how AWS gives us the ability to secure these data lakes and the rich features present to secure this data from internal and external threats while enabling business productivity.