Documentation

April 25, 2019

Common glossary used in our documentation.

Bayesian inference is the core of Aito's machine learning functions. It is a method of statistical inference in which the Bayes' theorem is used to update the probability for a hypothesis as more evidence and information becomes available. This method can be described by the following formula:

$P(H|E) = \frac{P(E|H) * P(H)}{P(E)}$stands for any hypothesis whose probability may be affected by the data (called evidence below)*H*stands for any evidence or known data.*E*, the*P(H|E)**posterior probability*, is the estimated probability of a hypothesis given the observed evidence., the*P(E|H)**likelihood*, is the estimated probability of observing evidence E given the hypothesis H is true., the*P(H)**prior probability*, is the estimated probability of the hypothesis H before the evidence E is observed., the*P(E)**marginal likelihood*, is the estimated probability that the evidence E is true.

Let's take a look at an example: * Predict the genre of a game given its description * .
Given the game "Cities:Skylines" description: "Cities: Skylines is a city-building game developed by Colossal Order and published by Paradox Interactive. Players engage in urban planning by controlling zoning, road placement, taxation, public services, and public transportation of an area". This will be the ** evidence**.
There are 5 available genres: "Action", "Fighting", "Puzzle", "Simulation", and "Strategy". This is the

In aito, the likelihood `P("Cities:Skylines..." | Action)`

is estimated by by breaking down the description into features and uses these features as multiple evidences for the inference.

The Bayesian can be translated to aito query by the following formula:

```
EVIDENCE => PROPOSITION
HYPOTHESIS => Target of an operation
```

For instance, we can ask aito to solve the predicting genre problem by using the ** MATCH** API, asumming that we have a game data table with field description and genere :

```
{
"from": "game_data",
"where": {
"description": {"$match": "Cities: Skylines is a city-building game developed by Colossal Order and published by Paradox Interactive. Players engage in urban planning by controlling zoning, road placement, taxation, public services, and public transportation of an area"}
},
"match": "genre"
}
```

To make better analysis of the data, Aito splits fields into features under the hood.
How the featurisation is done, depends on the field type defined in the database schema.
For example the `Text`

type supports an *"analyzer"* option which allows you to control
how a text field is splitted into features.

Some queries, for example Relate, return the features instead of the actual values of the field.

- If defined as
`String`

, the textual data is kept as it is and is not featurized. The whole textual data is counted as a singular feature. - If defined as
`Text`

with`"analyzer": "Whitespace"`

:- Split the text into features by white space
- "aito database" -> 2 features: "aito", "database"

- If defined as
`Text`

with`"analyzer": "English"`

:- Analyzer the text into stems by English
- "aito database" -> 2 features: "aito", "databas"

Lift is a ratio that measures the performance of a feature as having enhanced or diminished response, measured against the average for the population. For example, a population has an average risk of having lung cancer at 1%, but people who smoke among that population have a risk at 50%, then smoking would have a lift of 50 (50/1). Lift can be interpreted in that population, smoking can increase the risk of having lung cancer by 50 times. The lift is calculated by the following formula:

$lift = \frac{P(A\cap B)}{P(A) * P(B)}$Aito uses term frequency-inverse document frequency tf-idf for scoring. In short, tf-idf is a numerical statistic that combines term frequency which is the number of times a term occurs in a document, and inverse document frequency which is a measure of how much information a term provides.

**Note:**
This is different from typo and synonym suggestion.
We are planning to add these features to the similarity API soon.