Why LLMs struggle with your spreadsheet data

Why do LLMs struggle with tabular data? Find out when classic ML still wins. Read more

Blog

Wed, Sep 17, 2025

Authors

Rodrigo Gallardo

Lead Machine Learning Engineer

Agustín Castro

Machine Learning Engineer

When not to use Large Language Models (LLMs) for structured data, and what to use instead.

LLMs changed how we work with text. But if your most valuable data sits in rows and columns, such as prices, risk scores, transactions and inventories, then using an LLM for classification or regression on that tabular data is usually the wrong tool. You’ll likely trade away accuracy, cost, and determinism that classic Machine Learning (ML) delivers out of the box.

For many business use cases, like predicting customer churn, assessing credit risk or forecasting sales, traditional ML approaches still deliver better, faster, and more reliable results.

We’ll unpack why LLMs are not the best fit for most tabular data problems, when they might still make sense, and how to choose the right tool for the job with our problem-solving cheat sheet.

What makes tabular data different

Tabular (or structured) data organizes information into rows (records) and columns (features). Each column has a defined meaning, such as “Age,” “Income,” or “Subscription Status”, and each row represents a single instance.

For example, a loan default table might look like this:

Name	Age	Income	Default?
Alice	45	80,000	No
Bob	32	30,000	Yes

This structure is explicit and consistent, and that’s exactly what gets diluted when you feed it to an LLM, which sees everything as a linear stream of text.

LLMs read sequences of tokens. To feed a table to an LLM, you serialize it into text (e.g., "Age: 45, Income: 80000, ..."). That step:

Blurs the boundary between rows and columns.
Makes the model order-sensitive to row/column shuffles that shouldn’t matter.
Runs into context-window limits on larger datasets.
Risks “bleeding” information from one row into the interpretation of the next.

Classic tabular models (trees/GBMs, linear models) consume rows independently, respect feature types, and scale to millions of rows without prompt gymnastics.

Context window limitations

LLMs have a finite context length. Large tables can exceed this limit, forcing you to truncate or summarize data. Both approaches risk losing important information, while traditional tabular models handle thousands of rows effortlessly.

LLMs don’t like messy data

Real-world datasets are rarely perfect. You might have:

Missing values (N/A, nan)
Inconsistent formats ($40k vs. 40000)
Outliers or rare categories

Many traditional ML algorithms have explicit strategies for these issues, from imputation to special handling of “missing” as a category. Tree-based models natively branch on “missing,” while LLMs see “N/A” as just another token unless you hand-engineer prompts.

LLMs, unless carefully instructed, may misinterpret placeholders or treat similar values as unrelated. Robustness is earned through preprocessing and algorithm design, not prompts.

Encoding bias: numbers aren’t words

To fit structured features into an LLM prompt, you often encode them as text. This creates pitfalls:

Numbers are treated as strings, not quantities. Therefore, when values like “80,000” and “30,000” are converted into tokens, important numerical information can be lost. The tokenization process captures the sequence of characters, but it does not inherently represent properties such as magnitude, relative size, or numerical order.
Ordinal encodings (e.g., Red=1, Blue=2) can introduce false hierarchies.

All of this can skew predictions in ways you don’t intend.

Pre-existing bias from text training

Since LLMs are trained on massive text datasets, they carry prior associations between words and contexts. When those words appear as features in tabular data, the model may project spurious or irrelevant correlations learned from text, rather than relying solely on the dataset at hand. This differs from specialized models, which only learn patterns directly from the provided data.

Lower accuracy in classification/regression tasks

On purely structured tasks, traditional tabular ML generally outperforms LLM-prompting, while also being faster and cheaper to run. LLM outputs can vary with temperature and slight prompt changes; regulated decisions often require deterministic behavior and auditability.

Across benchmarks, specialized tabular models (like XGBoost, LightGBM, or Random Forests) consistently outperform LLMs on purely structured tasks. And since many LLMs may have been exposed to well-known benchmark datasets during training, the performance gap is likely even greater on unseen, domain-specific data.

Example results:

Dataset	Model Type	Accuracy
Titanic survival	Random Forest	78%
	LLM-based approach	71%
Credit default	Random Forest	97.5%
	LLM-based approach	~91%

Note:

The metrics/results presented in this table were obtained from the case study on tabular data published by Ikigai Labs, available at Ikigai Labs Blog.

LLMs only start to close the gap when the dataset contains significant unstructured text, for instance, product reviews alongside numeric data.

Your problem-solving cheat sheet

Use this matrix to choose the right approach for your problem.

Scenario	Data shape	Human-readable text involved?	Need for determinism / audit?	Primary goal	Recommended approach	Why
Credit risk scoring, pricing, churn prediction	Mostly numeric / categorical columns	No	High (governance, fairness)	Highest predictive accuracy with stability	Classic tabular ML (GBMs / trees, linear models)	Handles missingness / types; deterministic; strong accuracy on structured signals.
Sales forecasting with product descriptions	Structured + short text fields	Some	Medium–High	Blend structured signals with text	Hybrid: classic ML on tabular + LLMs for feature extraction	Keep the predictor tabular; use LLM only to extract features from docs.
Support triage / sentiment on tickets	Predominantly unstructured text	Yes	Medium	Understand language, summarize, route	LLM-first + guardrails	Native LLM territory; add guardrails for consistency & safety. Guardrails
KPI dashboards that must exactly match rules	Structured, stable schema	No	Very High	Repeatability, explainability	Classic tabular ML or rule-based	Deterministic behavior and easy review win here.
Exploratory analytics with analysts in the loop	Mixed	Maybe	Low–Medium	Speed of iteration	Hybrid	Use LLM for exploration & documentation; keep final predictive model tabular.

Design patterns that work in practice

Classic tabular Machine Learning first
Start with models built for tables: gradient-boosted trees (e.g., XGBoost/LightGBM/CatBoost) or regularized linear models. They:
- respect feature types,
- are robust to missingness and outliers,
- and provide deterministic, explainable predictions (feature importance, SHAP).
Tabular deep learning when warranted
With very large, rich tables (and well-designed embeddings for categoricals), tabular DL can be competitive. Use when classical baselines plateau and data scale/features justify extra complexity.
Hybridize instead of forcing
If your decision needs external context (contracts, emails, manuals), use LLMs to create features, then feed those features into a tabular model. Reserve the LLM for language; keep the final decision maker structured.
Guardrails for LLM-adjacent workflows
Wrap prompts with constraints, validation, and policy checks to improve reliability wherever LLMs touch the pipeline (classification labels, summaries, routing).

When not to use an LLM and what to do instead

Avoid LLMs as the primary predictor when:

Your data is mostly structured (numbers, categories) and the task is classification/regression.
You require determinism, audit trails, or compliance-grade consistency.
There’s no conversational component and no need to generate free-form text.
The problem is narrow (such as numerical or categorical), repeated, and performance-sensitive (latency/cost).

Prefer classic tabular ML (or tabular DL) when:

Accuracy matters: If your KPI depends on precise numeric or categorical predictions, specialized models are more reliable.
Interpretability and reproducibility matters: Business stakeholders often want clear reasoning, which tabular models can provide through feature importance metrics.
Cost matters: LLMs are resource-intensive; tabular ML models are cheaper to train and to serve predictions at scale with predictable costs.

Consider a hybrid approach when:

Valuable context lives in documents, emails, or manuals.
You can extract that context into features and let the tabular model decide.
You want analysts to interact with insights, but governance demands a structured decision engine.

Bottom line:

If your data looks like a spreadsheet and your goal is a numeric or categorical prediction, don’t start with an LLM. Start with tabular Machine Learning. Add LLMs around the decision (for enrichment, explanations, or ops), not as the decision.

Conclusion

LLMs are powerful tools for natural language problems, but they’re not a universal solution. For most structured data tasks, traditional ML approaches will give you better results, faster and at a lower cost.

Before committing to an LLM-based strategy, ask: Is this really a language problem? If the answer is “no”, you might be better off with proven tabular modeling techniques.

Want a second opinion on architecture or a quick bake-off (GBM vs. LLM-prompting vs. hybrid)? At Tryolabs, we help businesses choose the right AI approach for their data and objectives, validate fit, design for governance, and ship a solution that balances accuracy, cost, maintainability and scalability.

If you want to explore the best path for your next ML project, let’s talk!