Microsoft unveils a large language model that excels at encoding spreadsheets
New LLM has the "potential to transform data management and analysis, paving the way for more intelligent and efficient user interactions."
Microsoft has built a new large language model that might make accountants and data analysts start to feel a little nervous about their future job prospects.
It has released the first details of SpreadsheetLLM - a new model that is "highly effective across a variety of spreadsheet tasks" and, it's claimed, has the "potential to transform spreadsheet data management and analysis, paving the way for more intelligent and efficient user interactions."
After a pre-print paper about the model was quietly released at the end of last week, X started to fill up with jokes warning that "Karen might be out of a job soon".
One user claimed "SaaS is in deep, deep trouble." Another wrote, " It's going to be huge for the finance world."
Ethan Mollick, Associate Professor at the Wharton School of the University of Pennsylvania, tweeted: "This is another sign that LLMs are going to be able to work with structured & unstructured spreadsheet data soon. This will unlock a lot of use cases (projections, financials, valuations, etc.) and having a spreadsheet source of truth will tend to lower hallucinations."
So far, LLMs have been ill-equipped to deal with spreadsheets, which are "characterized by their extensive two-dimensional grids, flexible layouts, and varied formatting options, which pose significant challenges for large language models (LLMs)," Microsoft's team wrote.
"In response, we introduce SpreadsheetLLM, pioneering an efficient encoding method designed to unleash and optimize LLMs’ powerful understanding and reasoning capability on spreadsheets," it announced.
Tackling tokens: A new approach to spreadsheets
One of the problems with using LLMs in spreadsheets is that they get bogged down by too many tokens (basic units of information the model processes). To tackle this, Microsoft developed SheetCompressor, an "innovative encoding framework that compresses spreadsheets effectively for LLMs."
"It significantly improves performance in spreadsheet table detection tasks, outperforming the vanilla approach by 25.6% in GPT4’s in-context learning setting," Microsoft added.
The model is made of three modules: structural-anchor-based compression, inverse index translation, and data-format-aware aggregation.
The first of these modules involves placing "structural anchors" throughout the spreadsheet to help the LLM understand what's going on better. It then removes "distant, homogeneous rows and columns" to produce a condensed "skeleton" version of the table.
Index translation addresses the challenge caused by spreadsheets with numerous empty cells and repetitive values, which use up too many tokens.
"To improve efficiency, we depart from traditional row-by-row and column-by-column serialization and employ a lossless inverted index translation in JSON format," Microsoft wrote. "This method creates a dictionary that indexes non-empty cell texts and merges addresses with identical text, optimizing token usage while preserving data integrity."
READ MORE: "Fighting AI with AI": Zscaler leaders on new threats and how to defeat them
Another obstacle for LLMs comes when adjacent numerical cells share similar number formats.
"Recognizing that exact numerical values are less crucial for grasping spreadsheet structure, we extract number format strings and data types from these cells," Microsoft continued. "Then adjacent cells with the same formats or types are clustered together... streamlining the understanding of numerical data distribution without excessive token expenditure."
After conducting a "comprehensive evaluation of our method on a variety of LLMs" Microsoft found that SheetCompressor significantly reduces token usage for spreadsheet encoding by 96%.
Moreover, SpreadsheetLLM shows "exceptional performance in spreadsheet table detection," which is the "foundational task of spreadsheet understanding."
The new LLM builds on the Chain of Thought methodology to introduce a framework called "Chain of Spreadsheet" (CoS), which can "decompose" spreadsheet reasoning into a table detection-match-reasoning pipeline.
"Chain of Spreadsheet, the framework’s extension to spreadsheet downstream tasks illustrates its broad applicability and potential to transform spreadsheet data management and analysis, paving the way for more intelligent and efficient user interactions," Microsoft said.