Bloomberg, Man Group team up to develop open source "ArcticDB" database

"Build it or buy it?" That's been answered here...

Bloomberg, Man Group team up to develop open source "ArcticDB" database

Bloomberg and investment management firm Man Group are teaming up to co-develop an open source database dubbed ArcticDB – which Bloomberg will use to underpin its interactive BQuant analytics platform.

The unique agreement will see the two build on Man Group’s existing Arctic project – a high performance datastore for numeric data developed under a GNU LGPL licence since 2012 and used internally by the firm. (London-headquartered Man Group is an active investment management firm with $143 billion AUM.)

“ArcticDB deals with individual data elements spanning hundreds of millions of rows, or hundreds of thousands of columns, powering use cases such as deep tick [market trade] history analysis or modelling of large corporate bond universes” the two said. The joint project, added Bloomberg CTO Shawn Edwards on LinkedIn, will give “clients the ability to process, analyze, and backtest using billions of rows of time series data in seconds…”

Bloomberg CTO Shawn Edwards

Edwards said that ArcticDB will enable “key functionality for BQuant users who need a smooth and scalable way to access (read/write), transform, and version high-volume timeseries data, while they build, test, and deploy quant models" and ships with a “Python-friendly API backed by a new C++ engine.. designed to leverage modern cloud object storage and complement any existing data science tech stack.”

“This initiative also reinforces the ‘open source first’ philosophy my team has been promoting within Bloomberg over the past 15 years, in which our software engineers both use and contribute to the broader open source ecosystem” the Bloomberg CTO added.

See The Stack's earlier interview with Shawn Edwards here.

Arctic is a dataframe database: e.g. follows a table-like data structure and is capable of serialising a range of data types eg. Pandas DataFrames, Numpy arrays, Python objects so users don't have to handle different datatypes manually. (Pandas is a powerful open source data analysis and manipulation tool, built on Python.)

It was originally built to help Man Group deliver insight from the firehose of tick data (trade pricing data with details of bid and ask) and alternative data: As James Blackburn, CTO, Alpha Platform at Man Group, said some years back: “Researchers care about accessing data quite quickly, from a cluster of machines, and distributing that to do their simulations... this data is [typically in] a different shape. So if you're trading, say futures or or forwards versus options, or cash equities, the data is structured, but it isn't all of the same shapes that traditionally finance companies have modelled them in, painstakingly. This actually is quite slow. There's a strong impedance mismatch between what people want to do with data and how you model it in say, a relational store."

He added: "[It's] a never ending stream of messages. The messages are essentially unstructured documents. [Any two]ticks have different metadata in them [in the order of hundreds of fields] that might change or stay static."

Bloomberg’s BQuant meanwhile, where ArcticDB will be deployed, is an analytics platform for quantitative analysts and data scientists in the financial markets to quickly build, test, and deploy models for alpha generation, risk, and trading. These models often rely on high volume timeseries data that include Bloomberg’s range of multi-asset-class financial and alternative linked datasets, as well as a firm’s own internal data.

Man Group CTO: ArcticDB has no dependency on MongoDB...

Gary Collier, CTO of Man Group Alpha Technology, told The Stack in response to our emailed questions -- starting with what the challenges were with existing alternatives for time series databases in the market -- that "it’s worth clarifying that we think of ArcticDB as a 'DataFrame Database', not just a Time Series database. Time Series data is, of course, a type of data frame and ArcticDB natively supports many Time Series operations, but it deals with them in new ways as well as supporting features like ultra-wide Time Series, which we’ve found is quite rare."

Man Group CTO Gary Collier

He added: "At Man Group we’re heavy users of many open-source data science packages and are always careful to evaluate our internal tools against what’s out there -- either as a commercial or open source offering.

"When we started on the journey to what is now known as ArcticDB many years ago, there were no database offerings that could provide the scalability and integration with Python that our business demanded. Although we’ve seen a lot of growth in time-series database solutions in recent years, we believe that none support modern data science workflows, deal with the size and shape of real world data, or integrate as seamlessly with the Python data science ecosystem as ArcticDB, and that’s why we’ve focused on furthering it. The feature set and client-side architecture enable it to stand out against other products."

There are some unique aspects to working with time series data at scale, Man Group's CTO emphasised.

Coller said: "Frankly, there are a lot of challenges to manage when it comes to working with Time Series data at scale: firstly the size of data sets and the rate of data arrival. In our world, single data sets can easily be tens or hundreds of terabytes, and the arrival rate of market tick data is hundreds of thousands of messages per second.Another factor is that the “shape” of data is constantly evolving and changing, and data itself can be error prone and messy. Historic data can change, numbers can be restated, and it’s crucial to be able to manage this. For example when we’re building trading models, we need to know what the market/issuer/trader behaviour would have been with the data that was available at the time, not the restated data we might see now.

"Don’t forget that once you’ve got the data, there’s a complex value extraction process to be undertaken. Each step of our investment processes – whether alpha extraction, portfolio construction, execution or risk management – typically generates an even larger amount of 'derived' data. These could be features we think are important to model on a standalone basis, or other by-products we believe are important for tracing and explainability. Data this large and complex necessitates industrial scale compute, used efficiently and with minimal bottlenecks. We believe the way we have built ArcticDB allows true horizontal scale-out, and the most efficient use of resources possible. We want to ensure our team members can expend maximum cognitive load on dealing with business problems, rather than thinking about technology or accessing data. Too many technologies require a cognitive mindset shift into a different query language for data access, potentially reducing the productivity of these professionals. ArcticDB allows them to still think in Python and Pandas, even when dealing they’re with the largest data sets."

Predecessor Arctic itself was originally built with open source MongoDB as its underlying backend.

This won't be the case with ArcticDB the Man Group CTO confirmed to The Stack, saying that it is "ArcticDB has no dependency on MongoDB.ArcticDB is compatible with any S3 or S3-like storage backend. S3 is now pretty ubiquitous and hence a common backbone for interacting with the underlying storage, and provides a good starting point for resiliency, security, bandwidth and low latency. We were determined that ArcticDB should be easy to use, offer high performance and be efficient to run; this meant re-writing the database engine from the ground up, all in-house, with Data Frames, key data science operations and workflows as first order concerns."

We'll be sharing more when the ArcticDB GitHub repository is available.

Follow The Stack on LinkedIn