AI data storage strategies: A planetary and CTO challenge

The latest GPU servers consume 6-10kW each, and most existing datacentres are not designed to deliver more than 15kW per rack. With energy and water demand...

AI data storage strategies: A planetary and CTO challenge

In terms of its likely impact on the technology sector and society in general, AI can be likened to the introduction of the relational database; the spark that ignited a widespread appreciation for large data sets – resonating with both end users and software developers. With exploration of its capabilities having gone into overdrive, it is time for CTOs and IT teams to consider the wider implications on their stack.

To a certain extent, AI and ML can be viewed in the same terms, as they provide a formative foundation for not only building powerful new applications, but also enhancing and improving the way we engage with groundbreaking technology alongside large and disparate datasets. We’re already seeing how these developments can help us solve complex problems much faster than was previously possible.

Understanding AI data storage challenges


To understand the challenges that AI presents from a data storage perspective, we need to go back to first principles: Any machine learning capability requires a training data set. In the case of generative AI, the data sets need to be very large and complex, including different types of unstructured data. Generative AI also relies on complex models, and the algorithms on which it is based can include a very large number of parameters that it is tasked with learning.

The greater the number of features, size and variability of the anticipated output, the greater the data batch size combined with the number of epochs in the training runs before inference can begin.

Generative AI is in essence being tasked with making an educated guess or running an extrapolation, regression or a classification based on the data set. The more data the model has to work with, the greater the chance of an accurate outcome or minimising the error/cost function. Over the last few years, AI has steadily driven the size of these datasets upwards, but the introduction of large language models, upon which ChatGPT and the other generative AI platforms rely, has seen their size and complexity increase by an order of magnitude.

Storage in memory…

The learned knowledge patterns that emerge during the AI model training process need to be stored in memory - which can become a real challenge with larger models.  Checkpointing large and complex models also puts huge pressure on underlying network and storage infrastructure, as the model cannot continue until the internal data has all been saved in the checkpoint, these checkpoints act as restart or recovery points if the job crashes or the error gradient is not improving.

Given the connection between data volumes and the accuracy of AI platforms, it follows that organisations investing in AI will want to build their own large data sets to take advantage of the opportunities that AI affords. With data volumes are increasing exponentially, it’s more important than ever that organisations can utilise the densest, most efficient data storage possible, to limit sprawling data centre footprints, and the spiralling power and cooling costs that go with them.

This presents another challenge that is beginning to surface as a significant issue - the implications massively scaled-up storage requirements have for being able to achieve net zero carbon targets by 2030-2040. It’s clear that AI will have an impact on sustainability commitments because of the extra demands it places on data centres, at a time when CO2 footprints and power consumption are already a major issue. This is only going to increase pressure on organisations. The latest GPU servers, for example, consume 6-10kW each, and most existing datacentres are not designed to deliver more than 15kW per rack, representing a large and looming challenge for datacentre professionals as GPU deployments increase in scale.

Flash optimal for AI

Some vendors are already addressing sustainability in their product design. For example, all-flash storage solutions are considerably more efficient than their spinning disk (HDD) counterparts. At Pure Storage for example we’re even going beyond off the shelf SSDs, creating their own flash modules which allow all-flash arrays to communicate directly with raw flash storage, which maximises the capabilities of flash and provides better performance, power utilisation, and efficiency.

Flash storage better suited to running AI projects for reasons beyond the critical area of sustainability too. This is because the key to results is connecting AI models or AI powered applications to data. To do this successfully requires large and varied data types, streaming bandwidth for training jobs, write performance for checkpointing (and checkpoint restores), random read performance for inference and crucially it all needs to be 24x7 reliable and easily accessible, across silos and applications. This set of characteristics isn’t possible with HDD based storage underpinning your operations, all-flash is needed.

Data centres are now facing a secondary but equally important challenge that will be exacerbated by the continued rise of AI and ML.

That is water consumption, which is set to become an even bigger problem – especially when you take into consideration the continued rise in global temperatures. Many data centres utilise evaporative cooling, which works by spraying fine mists of water onto cloth strips, with the ambient heat being absorbed by the water, thus cooling the air around it. It’s a smart idea but it’s problematic, given the added strain that climate change is placing on water resources – especially in built-up areas. As a result, this method of cooling has fallen out of favour, resulting in a reliance on more traditional, power intensive cooling methods like air conditioning. This is yet another reason to move to all-flash data centres, which consume far less power and don’t have the same intensive cooling requirements as HDD and hybrid.

The road ahead for AI and data storage


As AI and ML continue to rapidly evolve, the focus will increase on data security (to ensure that rogue or adversarial inputs can’t change the output), model repeatability (using techniques like Shapley values to gain a better understanding of how inputs alter the model) and stronger ethics (to ensure this very powerful technology is used to actually benefit humanity). All these worthy goals will increasingly place new demands on data storage. Storage vendors are already factoring this into their product development roadmaps, knowing that CTOs will be looking for secure, high-performance, scalable, efficient storage solutions that help them towards these goals. The focus should therefore not be entirely on the capabilities of data storage hardware and software, the big picture in this case is very big indeed.

Join peers following The Stack on LinkedIn

Editor's note: This guest piece is by Alex McMullan, CTO, International, Pure Storage and a guest contributor to The Stack. This is not a "paid for" article: We periodically open our doors to columnists and if we find their input thought-provoking, allow them to return to our "Opinions" section. We limit this to four pieces per person. If you'd like to contribute, pitch your ideas to our editor Ed Targett. We cast a broad net; be bold.