Antti Rauhala
Co-founder
May 15, 2026 • 13 min read
When a SaaS team evaluates Aito, the first technical question is rarely about accuracy. It is about scale. Will it still work at 1 million invoices? At 10 million? With dozens of tenants on shared infrastructure, all querying at the same time?
These are fair questions. A predictive database that delivers great accuracy on a 10,000-row demo but melts at 5 million rows is not a product, it is a science project. So we put Aito on a bench, ran a multi-tenant invoice routing workload from 1,000 rows up to 10 million, measured what actually happens, and made sure to measure the realistic thing: a real HTTP request hitting Aito's API the way a production app would hit it. Not an in-process call that skips the parser, the protocol, and the text-tokenisation work. Those numbers are easy to make look good. They are also, as we found out, not what production looks like.
This post walks through what the realistic numbers actually are.
Throughput benchmarks for normal databases focus on inserts per second and query latency. A predictive database has the same dimensions plus a few extra:
The benchmark below targets all five. The headline result is the one most people care about: steady-state predict latency at 10 million rows, end-to-end through HTTP, stays under 300 milliseconds.
The dataset is a synthetic invoice routing workload. The schema mirrors what we see in production: a 4-table linked structure with companies, employees, GL codes, and invoices. Each invoice has three text fields (sender, product, free-text description) and three prediction targets:
The query is the same one our customers actually run in production:
{
"from": "invoices",
"where": {
"sender": "...",
"product": "...",
"description": "...",
"processor.company": 13
},
"predict": "processor",
"limit": 144
}
The test loads a fresh database, optimises the state (a one-time merge step that reduces query latency at the cost of a couple of extra minutes during ingestion), and then fires 64 random predict queries at the HTTP API in batches of 1, 2, 4, 8, 16, and 32. The first batch is one query against a cold state. The later batches are run back-to-back so the JVM is hot, the on-disk pages are paged in, and the per-token caches are populated.
We ran this on a single developer workstation with JDK 17 and a 12 GB heap. No special tuning. The booktest sources are public in the Aito core repository.
Mean predict latency in milliseconds per query, on the optimised 10-million-row state, by batch:
| Batch | Processor | Acceptor | GL code |
|---|---|---|---|
| 0–1 (cold) | 18,246 ms | 1,720 ms | 907 ms |
| 1–3 | 289 ms | 237 ms | 122 ms |
| 3–7 | 235 ms | 215 ms | 124 ms |
| 7–15 | 263 ms | 147 ms | 199 ms |
| 15–31 | 243 ms | 180 ms | 133 ms |
| 31–63 (steady) | 213 ms | 154 ms | 124 ms |
Two stories in one table.
The cold first query is multi-second. When Aito first opens a 10-million-row state and runs its first predict, it pays an 18-second tax on the processor query. This is not a bug. It is the cost of warming up: the JVM JIT has to compile the predict path, the per-text-field token analysis structures need to be materialised in memory, and the on-disk index pages need to be paged in by the OS. None of this happens for free. The acceptor and GL code first queries are smaller (1.7 seconds and 0.9 seconds) because by the time they run, most of that work is already done. Bringing the processor cold cost down further is on the engineering roadmap.
Once warm, latency settles in the low hundreds of milliseconds. From batch 1–3 onward, every query lands between roughly 120 and 290 ms across all three target fields. Variance is tight. There is no thermal runaway, no GC stall pile-up, no memory growth that gradually slows the system down across 63 queries.
For a SaaS use case, this means one thing: pre-warm Aito on deploy by firing a representative query against each tenant's state, and from then on you are in the predictable 120-to-220 ms band. That is comfortable territory for a predict-on-button-click UI feature, and it is fine for a workflow step in an automated pipeline.
The bigger question is what the scaling curve looks like across orders of magnitude. We ran the same benchmark at 1k, 10k, and 100k rows, and at 1M and 10M (post-optimize). Steady-state mean predict latency in milliseconds:
| Scale | Processor | Acceptor | GL code |
|---|---|---|---|
| 1 k | 40 ms | 33 ms | 19 ms |
| 10 k | 37 ms | 40 ms | 20 ms |
| 100 k | 41 ms | 45 ms | 20 ms |
| 1 M | 110 ms | 65 ms | 44 ms |
| 10 M | 213 ms | 154 ms | 124 ms |
A 10,000-fold increase in row count, from 1k to 10M, takes processor predict from 40 ms to 213 ms. That is a 5x latency increase for a 10,000x data increase. The scaling is dramatically sub-linear because Aito's per-row cost amortises against bitset operations and disk-resident indexes that scale logarithmically (or are constant) for most of the work.
Predictions stay fast enough for an interactive UI even as tenant data grows from thousands to tens of millions of rows. Under the hood, this comes from disk-resident bitset indexes that scale logarithmically with state size, a persistent cache for cross-table linkage construction so the same evidence is not rebuilt on every query, and lazy candidate scoring whose cost depends on the candidate set per query rather than the total row count. There is no separate retraining step to schedule, no model versioning to manage, and no infrastructure that scales differently from the rest of the database.
The shape of the curve is best read as two regimes. Below ~100k rows, latency is essentially flat (20 to 45 ms across all three targets) because per-token caches and per-field statistics dominate the cost and the dataset is small enough that they barely move. Above 100k, latency grows but slowly: the per-query cost is dominated by candidate-scoring math that depends on the candidate set size (around 144 employees per company), not the total invoice count. Going from 100k to 10M is a 100x data increase that takes processor predict from 41 ms to 213 ms, a 5x latency multiplier for 100x more data.
This is the property that makes Aito viable behind a SaaS UI. You do not have to design around prediction latency. You ship a query like any other.
Latency means nothing if accuracy collapses with scale. Top-1 accuracy across the 63-query test set, by target, at 10 million rows:
| Field | Top-1 accuracy | Top-3 accuracy |
|---|---|---|
| Processor | 78% | 87% |
| Acceptor | 94% | 97% |
| GL code | 87% | 97% |
The synthetic data is intentionally noisy (random syllable combinations rather than realistic supplier names, probabilistic routing rules rather than deterministic ones), which makes the absolute numbers a conservative floor for what real customer data would deliver. The point is the shape: more data does not break Aito's predictions, it improves them, and there is no regime where accuracy degrades as the dataset grows.
Two operational numbers worth knowing.
Optimisation wall. After bulk-loading data, Aito has an optional optimize step that merges segment files for faster steady-state predict. It is a one-time cost that pays itself back many times over.
| Scale | Optimise wall |
|---|---|
| 1 M invoices | 37 s |
| 10 M invoices | 535 s (~8.9 min) |
For a tenant that loads data once and then queries forever, optimisation is run once at onboarding. For a tenant that grows continuously, optimisation runs as a background maintenance task. Either way, predict latency stays in the steady-state band reported above. Without optimisation, the same query path works but with more variance and higher mean latency.
Heap and disk. At 1 million rows, the JVM heap settles around 800 MB during ingestion and drops back to a few hundred MB after optimisation. At 10 million, the heap profile is similar (Aito leans on memory-mapped on-disk indexes rather than caching everything in heap). Memory does not grow query-to-query at any scale we have tested. There is no leak. Capacity planning works the way it does for a normal database: look at row counts and disk size, not at how much GPU memory the next training run will need.
Production data does not arrive in a single bulk load. It arrives a row at a time, in small batches, as new invoices are processed. Single-row HTTP commits at the API level take a bit under a second on the legacy path. Bulk batched commits land much faster: a 100,000-row batch loads in 16 seconds (around 160 microseconds per row, end-to-end through the HTTP API including JSON parsing and disk write).
For a multi-tenant SaaS workflow where new invoices arrive a few at a time and need to be queryable immediately, the practical pattern is to batch incoming inserts into windows of a few hundred rows and commit each window in one HTTP call. That keeps the per-row cost amortised and lets predict queries in the same JVM stay in the steady-state band.
Two of our customers run roughly opposite shapes of this workload in production.
A high-volume accounting-automation partner routes invoices for accounting clients across many tenants on a dedicated Aito node. Each tenant has its own database with its own employees, its own accounts, and its own historical patterns. Aito sees query traffic that looks essentially like the benchmark: a where-clause over text fields, a predict on a categorical target, scoped to one tenant's candidate set. Steady-state latencies on real customer data are in the same band as the synthetic benchmark above.
Helsingö runs on a shared multitenant Aito instance with smaller per-tenant data. Same engine, same query interface, no per-tenant ML pipeline.
The same predictive engine runs in shared multitenant deployments, dedicated single-tenant nodes, and pure on-premise IP-licensed installations (used by Q-Automate and Sisua Digital). The latency numbers above are not "shared cluster" numbers and "dedicated cluster" numbers. They are the engine's behaviour, which is the same wherever it runs.
A few things this benchmark does not measure that you might want to know:
Translating the numbers into practical points:
You can ship predictions in your UI. Steady-state 200-to-300 ms is fast enough to put a prediction behind a button click without a loading spinner. No background queue, no asynchronous workflow, no "we will email you when it is ready."
You do not need a per-tenant ML pipeline. One Aito node serves many tenants. The engine handles tenant isolation at the database level. Onboarding a new tenant is a database creation, not a model training job.
You do not pay for retraining. Predictions come from queries, which run on whatever data is in the database at query time. New rows are immediately visible. Schema changes do not invalidate a model, because there is no model to invalidate.
You do need to think about warm-up. First query after open is slow. Pre-warm on deploy. After that, latency is predictable.
Capacity planning is normal-database planning. Look at row counts, disk size, and concurrent query rate. Not GPU memory.
The buyer concern that started this post (will it scale?) has a short answer: yes, on the production engine, end-to-end through the HTTP API, up to at least 10 million rows on a developer workstation with realistic SaaS query traffic.
A next-generation storage and query engine is in active development, currently pre-beta and hidden behind a feature flag. On text-free workloads it already runs predict around 25% faster than the production engine in steady state, and the hard case (cross-table link workloads) has closed most of the gap to the production path over the spring. We expect parity-and-faster across the board over the coming quarter. The numbers in this post are from the production engine.
Free trial is at aito.ai. The full benchmark code is open and runs on a developer workstation. If you want to discuss a specific deployment shape (multi-tenant economics, on-premise constraints, or your particular query patterns), reach us at hello@aito.ai. We read every inbound message.
Antti Rauhala is co-founder of Aito.ai. Aito is a predictive database for B2B SaaS platforms, headquartered in Helsinki.
Back to blog listEpisto Oy
Putouskuja 6 a 2
01600 Vantaa
Finland
VAT ID FI34337429