AWS touts a "zero-ETL" future but has more than a little work to do
Aurora + Redshift + Spark...
Extract, Transform, Load (ETL) jobs – combining data from multiple sources into a large, central repository – can be a colossal headache, even between AWS services. At a keynote on Tuesday, AWS CEO Adam Selipsky announced a “zero-ETL” drive to help tackle that pain point, including a preview that will enable “near real-time analytics and machine learning using Amazon Redshift on petabytes of data from Aurora."
The big idea: Obviating the need to build and maintain data pipelines for ETL operations; connecting the two services so that “within seconds of transactional data being written into Aurora, the data is available in Amazon Redshift.” (Aurora is a managed relational database service; Redshift is Amazon’s widely used data warehouse.)
This, said Selipsky, “helps solve one of the greatest ETL pain points for our customers” (that two AWS services interacting sub-optimally was a great customer ETL pain point is a point it would be rude to dwell on.)
AWS eyes zero-ETL future, new Redshift integration for Apache Spark
A new Redshift integration for Apache Spark meanwhile, designed to help developers build and run Apache Spark applications on Amazon Redshift data, was also announced. Amazon EMR (a cloud big data platform for running large-scale distributed data processing jobs) is also integrated into this set of “zero-ETL” releases.
For Amazon EMR 6.9, the integration is available across all three deployment models for EMR: EC2, EKS, and Serverless. AWS says customers can use these new services to build applications that directly write to Redshift tables as a part of your ETL workflows or to combine data in Redshift with data in other source.
Developers can load data from Redshift tables to Spark data frames or write data to Redshift tables: “Developers don’t have to worry about downloading open source connectors to connect to Redshift.”
See also: How Airbnb used K8s to tackle AWS costs
AWS, said Selipsky, is working aggressively to optimise end-to-end data strategy (from ingesting, to storing, to querying data) for customers; some of whom like Pinterest have over an exabyte of data on S3.
Many will need to wait to take this for a spin: Amazon Aurora zero-ETL integration with Amazon Redshift is only available in limited preview (for Amazon Aurora MySQL 3 with MySQL 8.0 compatibility) in US East.
Amazon Redshift integration for Apache Spark however is now available in all regions where Amazon EMR, Amazon EMR on EKS and Amazon Serverless are available. That will let developers build applications that directly write to Redshift tables as a part of ETL workflows, or combine data in Redshift with data in other source.
AWS says that its Redshift integration for Apache Spark enables applications on Amazon EMR that access Redshift data to run up to 10x faster compared to existing Redshift-Spark connectors and supports pushing down relational operations such as joins, aggregations, sort and scalar functions from Spark to Redshift to improve query performance.
What are your views on these releases? Get in touch.