author

Antti Rauhala

Co-founder

March 8, 2026 • 12 min read

Full disclosure: I am the co-founder of Aito, which falls in the predictive database category discussed below. I have tried to present all four categories fairly, but the reader should be aware of my position.

AI capabilities are migrating into the database layer. Not as separate services called from application code, and not as external pipelines that read and write data, but as native capabilities of the query interface itself.

Several distinct approaches have emerged. Some databases add similarity search over embeddings. Others bring model training into SQL. Others route queries to large language models. And a fourth, the youngest and smallest of the four, makes the database itself capable of statistical inference without a separate model.

This article surveys four architecturally distinct categories of AI databases as they exist in 2026, and briefly discusses related approaches that fall outside the taxonomy.

Diagram showing four AI database categories: Vector Databases, ML-in-Database, LLM-Augmented Databases, and Predictive Databases, each with their member implementations
The four AI database categories as of 2026, with representative implementations in each.

What are vector databases?

Vector databases store and retrieve high-dimensional embeddings. Text, images, or other unstructured data is converted into numerical vectors using an embedding model, stored, and then queried by semantic similarity. The primary use cases are semantic search and retrieval-augmented generation (RAG).

The ecosystem has matured rapidly. pgvector brings vector similarity search to Postgres as an extension. Purpose-built systems include Pinecone, Weaviate, Qdrant, Milvus, and Chroma.

Multimodal variants like LanceDB extend the vector database concept by storing raw data (images, audio, documents) alongside their embeddings and structured metadata in a unified format, enabling hybrid queries across all three.

Strengths: semantic search where keyword matching fails, RAG pipelines for grounding LLM responses in data, similarity matching and deduplication.

Limitations: designed for unstructured data retrieval, not for prediction over structured tabular data. If the task is to predict an expense category from a Postgres table of transaction records, the predictive signal lives in column correlations and statistical relationships between fields, not in an embedding space.

What are ML-in-database platforms?

ML-in-database platforms bring the machine learning workflow (model training and inference) inside the database's SQL interface. Rather than exporting data to an external ML platform, models are defined and trained using SQL statements.

Representative implementations:

  • MindsDB: CREATE MODEL in SQL, with automated feature engineering and model selection. Integrates with over 100 data sources.
  • PostgresML: a Postgres extension providing pgml.train() and pgml.predict(). Supports scikit-learn, XGBoost, LightGBM, and deep learning directly inside Postgres.
  • BigQuery ML: Google's approach. CREATE MODEL trains, ML.PREDICT() serves predictions. Supports logistic regression, random forests, boosted trees, and neural networks.
  • Redshift ML: AWS's equivalent. Uses SageMaker Autopilot for training, exposed as SQL functions.
  • ClickHouse: integrates CatBoost models for inference on analytical queries.

The common pattern is train first, predict later, but do both in SQL. The database becomes the ML platform.

Strengths: teams already invested in a specific database can add prediction without new infrastructure. Data stays inside the database boundary, which simplifies governance. SQL-fluent teams can use existing skills.

Graph databases with ML capabilities follow the same architectural pattern. Neo4j Graph Data Science trains graph neural networks and computes graph embeddings inside the database, while Amazon Neptune ML uses GNNs for link prediction and node classification. The difference is that the model operates on graph topology rather than tabular data, but the lifecycle is the same: train, deploy, retrain.

Limitations: the train-then-predict pattern persists. Models must be retrained when data distributions shift, versioned, and monitored. The database automates the execution of the ML workflow, but the workflow itself, with its retraining schedules and drift monitoring, remains.

What are LLM-augmented databases?

LLM-augmented databases route SQL queries to large language models. Instead of training a model on historical data, the database sends row data to an LLM and returns the LLM's response as a query result. This category has grown rapidly in 2025-2026 as major cloud providers shipped native LLM integration.

Representative implementations:

  • Snowflake Cortex: CLASSIFY_TEXT(), COMPLETE(), SUMMARIZE(), SENTIMENT() as SQL functions. LLM runs inside Snowflake infrastructure.
  • Databricks AI Functions: ai_classify(), ai_query(), ai_extract(), ai_summarize() calling Foundation Models.
  • BigQuery AI Functions: AI.CLASSIFY(), AI.SCORE(), AI.IF() routing to Gemini models.
  • AlloyDB AI: ai.if(), ai.rank(), ai.classify() in a Postgres-compatible transactional database, calling Vertex AI.
  • SingleStore Aura: routes data through LLMs at query time for classification and enrichment.

The pattern: no training, no model management. The LLM is the model, accessed as a SQL function. The database builds a prompt with the relevant data and the LLM returns a classification, extraction, or score.

Strengths: zero setup for text tasks like sentiment analysis, entity extraction, and classification. World knowledge for cold-start scenarios where no historical patterns exist. Flexibility: the "model" is the prompt, changeable without retraining.

Limitations: cost scales linearly with prediction volume (each query is an API call). Latency of hundreds of milliseconds to seconds per call. Non-deterministic outputs. No calibrated confidence scores: the LLM does not produce probability estimates reliable enough for tiered automation (e.g., auto-process at 95% confidence, route to human review at 60%).

What is a predictive database?

A predictive database makes statistical inference a native capability of the query layer. There is no explicit model training lifecycle: no CREATE MODEL, no training job, no model artifact to version or retrain. Instead, the database generates predictions directly from the stored data at query time.

This is the smallest and youngest of the four categories. It has fewer production implementations than the other three, but its architectural properties are distinct enough to warrant separate treatment.

The theoretical foundation comes from probabilistic programming research. MIT CSAIL's BayesLite demonstrated the approach: a Bayesian inference engine on top of SQLite, using CrossCat (a nonparametric Bayesian model for tabular data, Mansinghka et al. 2016) to serve arbitrary predictive queries through an elegant BQL (Bayesian Query Language). PredictiveDB (Jain et al., CIDR 2011) explored a related concept as a research prototype but was not continued.

The defining properties of a predictive database:

  • No explicit model training lifecycle. There is no separate training pipeline, no model artifact to deploy, version, or retrain. Predictions are generated at query time using lazy learning: the system selects relevant features and computes a query-specific model on the fly. This does involve real computation (feature selection, Bayesian inference over indexed statistics), but the operational burden of model lifecycle management is eliminated.
  • Query predictions like data. The query interface for predictions is the same as for data retrieval. Results include a confidence score for each prediction alternative.
  • Structured data native. Purpose-built for tabular data, the kind stored in relational databases.
  • Cold-start capable. Bayesian priors and columnar inference produce useful predictions even with small datasets (hundreds of rows), where traditional ML models struggle to converge.

As an example, BayesLite's BQL allows querying predictions directly:

ESTIMATE salary FROM employees
  GIVEN occupation = 'data scientist', years_experience = 5
  LIMIT 5;

Aito provides a similar capability through a JSON query interface, returning ranked predictions with probability scores:

{
  "from": "transactions",
  "where": { "merchant": "Staples", "amount": 84.50 },
  "predict": "expense_category"
}

In both cases, the predict operation returns confidence-scored alternatives. Aito's predictions are designed to be calibrated: a returned confidence of 0.89 should correspond to approximately 89% actual accuracy at that confidence level. (A detailed analysis of calibration properties is covered in a later post in this series.)

Strengths: structured data prediction without model lifecycle management. No model to train, version, retrain, or monitor for drift. The multi-tenant advantage is concrete: in a SaaS product with hundreds of customer accounts, each with different data patterns, an ML-in-database approach requires training and maintaining a separate model per customer. A predictive database generates per-query predictions from each tenant's data without any per-tenant model management. Cold-start and low-data situations benefit from Bayesian priors. Predictions reflect new data immediately without retraining.

Limitations: because lazy learning computes at query time, latency scales with dataset size. On small-to-medium datasets (up to tens of thousands of rows), query-time inference runs in milliseconds. On larger datasets, latency increases. For very large tables, pre-trained models will offer better inference latency. The approach is not designed for unstructured data. Accuracy on large, homogeneous datasets where traditional ML models have ample training data is competitive (benchmark results) but less extensively benchmarked than mature ML-in-database implementations. This is the youngest of the four categories, with the smallest ecosystem.

Known implementations: BayesLite (MIT CSAIL, research), Aito (production), PredictiveDB (Jain et al. 2011, research prototype, not continued).

How inference works in each category

The architectural differences between the four categories become clearer when examining how each handles a prediction query:

Architecture diagram showing inference flow in four AI database types: Vector DB uses direct embedding lookup, ML-in-DB uses a pre-trained model with its own lifecycle, LLM-augmented DB calls an external LLM API, and Predictive DB runs Bayesian inference directly from data structures and statistical caches
How inference works in each AI database category. Vector databases and predictive databases serve queries directly from internal data structures. ML-in-database platforms route through a separately trained model. LLM-augmented databases make an external API call per query.

Vector databases perform a direct lookup: the query embedding is compared against stored embeddings using a similarity metric. No model is involved at query time. The intelligence is in the embedding model used at indexing time.

ML-in-database platforms route the query through a pre-trained model that lives inside the database. This model has its own lifecycle: it must be created, retrained when data drifts, versioned, and eventually deprecated. The prediction is fast (the model is pre-computed), but the model management overhead persists.

LLM-augmented databases build a prompt from the row data and send it to an external LLM API. The round-trip adds latency and per-call cost. No model lives inside the database; the intelligence is entirely external.

Predictive databases run inference directly from the database's own data structures and statistical caches. No external call, no pre-trained model artifact. The Bayesian inference engine reads indexed data and pre-computed statistics to construct a query-specific model and generate predictions at query time. This involves real computation (feature selection, posterior calculation) but without a separate model lifecycle. Architecturally, this is closest to how a vector database serves a similarity query: the intelligence is embedded in the data structures themselves, applied to structured prediction rather than retrieval.

Comparison of AI database categories

Vector databasesML-in-databaseLLM-augmentedPredictive databases
Core mechanismEmbedding similarity searchPre-trained model in SQLLLM API call per queryBayesian inference at query time
Primary data typeUnstructured (text, images)Structured (tabular)Any (via prompts)Structured (tabular)
Training required?Embedding modelYes (CREATE MODEL)NoNo (query-time computation)
Retraining needed?Re-embed on changesYes, on data driftNoNo (reflects live data)
Inference latencyLow (ms-scale)Low (model pre-computed)High (100ms-seconds)Low on small-medium data; scales with dataset size
Cost modelInfrastructureInfrastructure + ML opsPer-callInfrastructure
Confidence scoresSimilarity scoreModel-dependentNoCalibrated (Bayesian)
Cold-start / low dataN/ANeeds training dataWorks (world knowledge)Works (Bayesian priors)
Accuracy profileN/A (retrieval)Strong on large stable datasetsStrong on text; variable on structuredCompetitive; less extensively benchmarked
Ecosystem maturityMatureMatureGrowing rapidlyEmerging
Best suited forSemantic search, RAGStable datasets, ML-capable teamsText tasks, zero-setupStructured prediction, multi-tenant, low-data
Implementationspgvector, Pinecone, Weaviate, Qdrant, Milvus, Chroma, LanceDBMindsDB, PostgresML, BigQuery ML, Redshift ML, ClickHouse, Neo4j GDS, Neptune MLSnowflake Cortex, Databricks AI, BigQuery AI, AlloyDB AI, SingleStore AuraBayesLite, Aito, PredictiveDB

What this taxonomy excludes

Several related approaches fall outside the four categories above, either because they are not databases or because the AI capability serves a different purpose.

AI data platforms such as Spark MLlib, Databricks Lakehouse, and Ray provide distributed ML training and batch inference over data lakes. These are where a large share of production ML happens, but they are compute platforms rather than databases. The data lives in a separate storage layer (Delta Lake, Iceberg, S3), and the ML runs as distributed jobs over it. Some of these platforms also appear inside the taxonomy above: Databricks AI Functions is an LLM-augmented database feature, while Databricks MLflow is a platform capability.

AI-autonomous databases such as Oracle 26ai use machine learning to optimize the database itself: learned query optimizers, auto-tuning, self-patching. The AI is inward-facing (it serves the database) rather than outward-facing (it serves the application). This is a different axis entirely.

Feature stores such as Feast, Tecton, and Hopsworks manage the engineered data inputs that feed ML models. They sit on top of databases (using Redis, DynamoDB, or BigQuery as backing stores) and are part of ML infrastructure, not the database layer.

Standalone AutoML platforms such as Auto-sklearn, AutoGluon, and H2O AutoML automate model selection and hyperparameter tuning for tabular data. They overlap in use case with both ML-in-database platforms and predictive databases, but they are ML tools, not databases. The data must be exported to the platform for training.

Document databases with vector search such as MongoDB Atlas Vector Search add embedding similarity as a new index type alongside their existing document model. Architecturally, the database remains a document store; vector search is a feature, not a category change.

Outlook

An important caveat: these categories describe inference architectures, not vendor boundaries. In practice, the major platforms span multiple categories. Databricks offers both ML-in-database (MLflow) and LLM-augmented (AI Functions) capabilities. BigQuery has both BigQuery ML and AI Functions. Snowflake has Cortex ML and Cortex LLM functions. A team choosing a platform will often get access to several categories at once. The taxonomy is useful for understanding what kind of inference you are running, not for choosing a vendor.

The broader pattern is directional: AI capabilities are migrating from standalone services into the database layer. Each category represents a different architectural bet on how that migration should work. The three mature categories (vector, ML-in-database, LLM-augmented) have proven their value in production at scale. The fourth (predictive databases) is the newest entrant, architecturally distinct but with a smaller ecosystem and less production mileage. Whether it grows into a major category or remains a niche depends on whether the architectural advantages translate into adoption.

What is clear is that "AI database" is no longer a single concept. It is at least four distinct approaches, each with different trade-offs.


Next in this series: What is a predictive database?. A precise definition, how columnar inference works technically, and when to use one.

Back to blog list

Start predicting in minutes — no ML expertise required.