# Glossary

Documentation  June 25, 2019

Common glossary used in our documentation.

## Bayesian inference

Bayesian inference is the core of Aito's machine learning functions. It is a method of statistical inference in which the Bayes' theorem is used to update the probability for a hypothesis as more evidence and information becomes available. This method can be described by the following formula:

$P(H|E) = \frac{P(E|H) * P(H)}{P(E)}$
• H stands for any hypothesis whose probability may be affected by the data (called evidence below)
• E stands for any evidence or known data.
• P(H|E) , the posterior probability, is the estimated probability of a hypothesis given the observed evidence.
• P(E|H), the likelihood, is the estimated probability of observing evidence E given the hypothesis H is true.
• P(H), the prior probability, is the estimated probability of the hypothesis H before the evidence E is observed.
• P(E), the marginal likelihood, is the estimated probability that the evidence E is true.

Let's take a look at an example: Predict the genre of a game given its description . Given the game "Cities:Skylines" description: "Cities: Skylines is a city-building game developed by Colossal Order and published by Paradox Interactive. Players engage in urban planning by controlling zoning, road placement, taxation, public services, and public transportation of an area". This will be the evidence. There are 5 available genres: "Action", "Fighting", "Puzzle", "Simulation", and "Strategy". This is the hypotheses. We can use Bayesian Inference to solve our problem by finding the probability of each genres given the description evidence. For example, with the "Action" genre:

$P(Action|Cities: Skylines) = \frac{P(Cities: Skylines|Actions) * P(Actions)}{P(Cities: Skylines)}$

In aito, the likelihood P("Cities:Skylines..." | Action) is estimated by by breaking down the description into features and uses these features as multiple evidences for the inference.

The Bayesian can be translated to aito query by the following formula:

EVIDENCE   => PROPOSITION
HYPOTHESIS => Target of an operation

For instance, we can ask aito to solve the predicting genre problem by using the MATCH API, asumming that we have a game data table with field description and genere :

{
"from": "game_data",
"where": {
"description": {"\$match": "Cities: Skylines is a city-building game developed by Colossal Order and published by Paradox Interactive. Players engage in urban planning by controlling zoning, road placement, taxation, public services, and public transportation of an area"}
},
"match": "genre"
}

## Feature

To make better analysis of the data, Aito splits fields into features under the hood. How the featurisation is done, depends on the field type defined in the database schema. For example the Text type supports an "analyzer" option which allows you to control how a text field is splitted into features.

Some queries, for example Relate, return the features instead of the actual values of the field.

### Textual feature splitting

• If defined as String, the textual data is kept as it is and is not featurized. The whole textual data is counted as a singular feature.
• If defined as Text with "analyzer": "Whitespace":
• Split the text into features by white space
• "aito database" -> 2 features: "aito", "database"
• If defined as Text with "analyzer": "English":
• Analyzer the text into stems by English
• "aito database" -> 2 features: "aito", "databas"

## Lift

Lift is a ratio that measures the performance of a feature as having enhanced or diminished response, measured against the average for the population. For example, a population has an average risk of having lung cancer at 1%, but people who smoke among that population have a risk at 50%, then smoking would have a lift of 50 (50/1). Lift can be interpreted in that population, smoking can increase the risk of having lung cancer by 50 times. The lift is calculated by the following formula:

$lift = \frac{P(A\cap B)}{P(A) * P(B)}$

## TF-IDF

Aito uses term frequency-inverse document frequency tf-idf for scoring. In short, tf-idf is a numerical statistic that combines term frequency which is the number of times a term occurs in a document, and inverse document frequency which is a measure of how much information a term provides.

Note: This is different from typo and synonym suggestion. We are planning to add these features to the similarity API soon.

Back to developer docs