Designing and Deploying an AWS Data Lake Formation Workflow
eSage Group performed an initial research phase to understand the business and technical requirements, the potential impact of the move and to evaluate the appropriate technologies to retire Alooma.
Key points from the initial research:
- Source database is a MariaDB hosted in Rackspace, but the database engine or the hosting could change in the future.
- ~300 tables needed to be in sync or replicated from MariaDB to Redshift.
- New tables could be added at any time.
- Schema of existing tables could change.
- All tables should be set to sync at least once a day and only one set of tables should be sync’d every 15 min, during workdays.
- Several tables have no “last updated at” column or an identity column.
- Alooma keeps raw data in Redshift which is difficult to consume for analysis and it’s expensive.
- Raw data or a full copy of the tables should be maintained in S3 instead of Redshift.
- SQL experts and users with less SQL experience consume these tables.
- Limit user access to only data that is relevant for their role. ie. Manufacturing team can’t access Finance data.
eSage Group selected AWS Lake Formation to replace Alooma. Lake Formation creates Glue workflows that crawl source tables, extracts the data, and loads it to S3. You can define JDBC connections to allow Lake Formation to access your AWS databases and on-premises databases including Oracle, MySQL, Postgres, SQL Server and MariaDB. Once your data is in your S3 data lake, you can define access policies, including table and column level access controls and enforce encryption for data at rest. You can then use a wide variety of AWS analytic and machine learning services to access your data lake as well as third-party business applications, such as Tableau and Looker. All access is secured, governed, and auditable.
An AWS Lake Formation workflow was created to perform incremental loads, although Lake Formation has some limitations that required custom logic.
- AWS Glue Jobs keep track of the data loaded during the incremental loads. Thus creating a Workflow for the tables with high-frequency load and one for the daily load is not an option. Because the Glue Job keeps bookmarks of the loaded data independently, this method would create duplicate data because some tables are in both types of Jobs.
- By default, Lake Formation crawls all tables in the source database, even if the Job will load just a few of them. Crawling all the source tables during the high-frequency loads causes unwanted latency.
- There’s no option to add new tables to the workflow once it’s created.
- Glue Jobs fail when more than one table starts with the same name. e.g. customer and customer_address.
To address Lake Formation limitations, we built a custom Python script to be executed via an AWS Lambda function. The Lambda function was triggered by a CloudWatch event depending on the time of the day that the load is needed–either daily or every 15 min. Depending on the type of load, the Python script configured the Glue Jobs at runtime by setting the tables that needs to be loaded, turning on/off crawler, and other necessary tasks based on a configuration file stored in S3.
- Reduced Redshift cluster footprint, raw data in S3.
- AWS cloud-native solution, no need for third-party tools like Alooma.
- Enhanced security by limiting users’ access.
- Centralized location of raw data.
- Provided self-service access to data.
- With Lake Formation we built a data catalog that describes the different data sets that are available, making it easier for analysts and data scientists to use their preferred analytics service. They can now use Redshift Spectrum, Amazon Athena etc.
- AWS Lake Formation Workflow with custom AWS Glue Jobs.
- Data Lake in AWS S3.
- Data Catalog.