Introducing Apache Hop: Matt Casters gets the Kettle team back together...
Kettle gets forked, dismantled, rebuilt for 2021.
Not every open source project in the Apache Software Foundation (ASF)'s incubator has the established history of Apache Hop: its genesis was 20 years ago as the Extract, Transform, Load (ETL) tool Kettle.
Reborn recently as an ASF project, Apache Hop includes a graphical user interface (GUI) editor for building data pipelines and workflows -- allowing users to build a complex ETL jobs without writing any code.
(While the project has roots going back decades, it is being firmly reworked for the workflow needs that data professionals have in 2021. What started as a Kettle fork is being turned into a new creature entirely with a lightweight architecture that supports Kafka, Spark, Flink and Google Dataflow over Apache Beam).
Apache Hop also features a standalone CLI utility, and "Hop Server", a web container to execute your pipeline and workflow on a remote server, with a REST API to remotely invoke workflow and pipeline.
Common use cases include:
- Loading large data sets into databases
- Combining relational databases, files, NoSQL databases like Neo4j, MongoDB, Cassandra etc
- Data migration between different databases and applications.
- Data profiling and data cleansing.
It has been designed to work in "any scenario, from IoT to huge volumes of data, on-prem, in the cloud, on a bare OS or in containers and Kubernetes". Apache Hop plays well with Azure; Apache Cassandra; Neo4j; and Google.
See also: Introducing Apache Sedona
Among Apache Hop's leading lights is Matt Casters, who is currently Chief Solutions Architect at Neo4j, one of the world’s leading graph database platforms. He's also the original author of Kettle and now a key member of the small but active Apache Hop team. (When we check in to the GitHub repo, there's been 2,241 commits, with the last one just hours ago. The project is very much alive and kicking.)
Casters told The Stack: “Apache Hop originated over 20 years ago when I started the software under the name ‘Kettle’. Back then I simply wanted to give business intelligence professionals such as myself a flexible way to manipulate data. In 2006, we joined forces with Pentaho, which in turn was acquired by Hitachi Vantara. A number of years ago a group of people from the Kettle community started to clean up this old codebase, which meant that we re-wrote all of the tools and completely reworked the software architecture. To completely open up the development to anyone, we joined the ASF incubator program.”
Apache Hop: Over 250 plugins.
Apache Hop, where Hop is an abbreviation for Hop Orchestration Platform, is an open source, metadata infused, flexible and sturdy data orchestration and data engineering platform installed with over 250 plugins.
Commenting on the huge bank of plugins, Casters revealed, “The data orchestration landscape has become quite complex. We believe that Apache Hop needs to play well with all systems and needs to be capable of fitting into that digital ecosystem instead of the other way around. To do this you indeed need a lot of flexibility for example to implement the last missing piece of custom-built software or file format. This flexibility is provided in part by the plugin system but also by the toolset, the libraries, a simple programming API, documentation, docker containers, flexible execution engines to support Apache Spark, Apache Flink, GCP DataFlow and so on. We'll continue to make it as easy as possible for you to get your work done.”
Apache Hop can be interlaced with a range of existing architectures, running on a cloud or on-premises. Data is organized either through batches or streams or even on a batch/streaming hybrid model and the open source tool can be used to support a range of front-end tasks.
Apache Hop GUI
Data professionals use Apache Hop’s GUI (graphical user interface), essentially a drag-and-drop interface, to build, run, edit, preview and debug workflows and pipelines. These can also run-on Apache Spark, Apache Flink, Google Dataflow and AWS EMR through Beam.
The Apache Hop GUI also plugs in to Neo4j graph, a native graph database platform as one of its tools, which deals with not only data but also data relationships.
Casters tells The Stack: “Apache Hop is indeed one of the first tools to deliver an extensive set of plugins for Neo4j graph databases and we do allow you to update complex graphs.
"Not only that, but we realized that complex data orchestration architectures are very graph-y in nature as well; so providing execution lineage, easy error tracing and so on are provided automatically on top of Neo4j, if you run your workflows and pipelines with Hop.”
What's the plan for the project? "The near-term goals are to get a rock-solid 0.99 release out of the door which can then be tested further by the larger community," Casters says modestly. "A release of version 1.0 should be the result of that not too long after. Graduation as a top-level Apache project is also important to us. After the 1.0 release we have a long list of cool enhancements in mind, in a lot of areas like improving the execution-preview-debug experience of users. Fixing the last couple of small issues in Hop Web will also be a priority for us. After that we're thinking of pluggable field-level expressions, extra GUI plugin features, dock able dialogs, further cloud integration and much more.”
Learn more and download Apache Hop here.