The data plumbing for real-time machine learning is becoming more accessible

"It extracts the most relevant data so you only compute what’s necessary"

Real-time Machine Learning (RTML) is everywhere these days; there is even a pretty good chance you will use it today. It is that minor miracle where you are recommended the right product without having searched for it. Or the satisfaction you feel when you get an alert to say that the car you booked is around the corner and will take you off the cold, rain-drenched streets in two minutes. Across enterprises that take customer-centricity seriously, the technology ingredients needed for this special sauce are understandably in increasingly high demand.

DataStax’s Chief Product Officer Ed Anuff puts it like this: “RTML is anything making recommendations or predictions as the data is happening. You perceive it all the time in Uber telling you how long the driver is going to be; in package tracking, product recommendations… these are all forms of retail at the point of interaction.

"With RTML, you have this set of data coming in with prediction scores. The data you’re running off becomes the most important thing. [To make this work] you need large sets of event data with some time element to it, such as your Uber was there five minutes ago, and now it’s five minutes away. If you turn that data into a format your ML models can process, then you can get a result from that in terms of the value, distance or price.”

DataStax already provides critical plumbing for many organisations running data-centric workloads, from storage through to streaming. Earlier this year it also agreed to buy Kaskada, a Seattle company that markets itself as “the first feature engine with time travel”. More practically, Kaskada cracked the code on managing and storing event-based data in order to train behavioural ML models and deliver insights or nudges in near real-time.

Real-time machine learning:

So, what makes DataStax and Kaskada a match? It’s back to that time-based data and where it sits…

“All that data sits within a database and that’s where Cassandra has historically played,” says Anuff.

“Uber uses Cassandra, Netflix uses Cassandra, FedEx uses Cassandra because it can handle limitless scale and all these events you need to store, and that’s what ML needs. These customers have very large amounts of data on the internet but they need some tools to make it usable within the ML algorithm. We looked around and that’s what Kaskada specialises in. It extracts the most relevant data so you only compute what’s necessary, then it maps it very efficiently and that reduces your cost significantly," he tells The Stack.

DataStax's Kaskada acquisition has its genesis in a pivot DataStax made in 2019: “We had decided three years ago that our strategy in the near term was to get Cassandra into the cloud and make DataStax a cloud company. We also knew that need to connect with all these great things happening in the AI world,” says Anuff.

DataStax Astra just took Cassandra Serverless. That’s a game-changer.

Its subsequent evolution allows DataStax to support clients across a tranche of use cases “anytime you have those streams of data where things happen and you want to guess what happens next.”

That covers the retail , commerce and search examples noted above but also cybersecurity, compliance or governance, for example, where suspicious actions could be calibrated faster and more efficiently.

“You may not know what those signals are,” says Anuff, “but ML can detect things humans can’t and if the probability score indicates better than, say, a 5% chance that an activity is fraudulent, you can check.”

Race is off

What next? Integrations are at an early stage, but DataStax already has double-digit POCs with Kaskada underway. Some present this sort of scenario as a race between companies to get to real-time ML and AI where the winner gets a first-mover advantage and a chance to rule the roost by providing a keystone.

Does he see it that way? “It’s going to be necessary that database vendors are going to be thinking about RTML and incorporating it. The technology itself is changing very radically and it’s very disruptive.

“You can’t get an RTML stack off the shelf; there are vendors who claim they can do it but they’re all assembling it themselves. You have Spark, DIY models or limited high-level ‘batteries included’ products,” Anuff notes to The Stack. He sees “more commonality” of RTML around five to seven years out for most.

For now though, “the most important thing is the data and how you ship your data to ML, and that’s expensive, error-prone and cumbersome. When I look at DIY projects, those are trade-offs.

“People know how to do really accurate models but it’s slow and super-expensive.

“The reason why it’s so hard is they’re trying to reduce the complexity and reduce the costs and that requires all sorts of optimisations to increase accuracy and move faster.

“It’s better to follow data gravity and move ML to the data, which is what we’re doing here.”

DataStax will now add Kaskada to its own cloud services, including its Astra DB built on Apache Cassandra and its Astra Streaming service built on powerful and production-ready Apache Pulsar event streaming software. This, claims DataStax, will give organizations a single environment for delivering applications infused with real-time AI using an advanced ML/AI model, democratising access to the kind of technology used by firms including Netflix and Uber as a cost-effective managed service built on open source.

Both DataStax and Kaskada have a track record of contributing to open source communities such as Apache Cassandra, Apache Pulsar and Apache Beam. DataStax plans to open source the core Kaskada technology initially and will offer a new machine learning cloud service in the first half of 2023.

Sponsored by DataStax

See also: Using real-time data requires organisational changes