Number support

Documentation

October 7, 2019

Using and predicting numeric values

Aito naturally allows one to store both integers and floating point numbers in the Aito database. To do this, see database schema documentation for more information on the topic.

Aito also makes it possible to query number ranges as in normal database, or use these ranges as a part of a predictive query. See language comparison operations for more information.

Still, the continuous numbers may need additional treatment to work properly in Aito's predictive queries. To understand the reason, why they need special treatment, consider the following. Aito works statistically, but very large numbers like 32429122 or decimals like 3.234912 can be statistically extremely rare. If you use such numbers as propositions, it's likely that Aito cannot find enough samples for the value to provide meaningful predictions.

At the same time, Aito cannot assume that for example 32429122 and 32429123 have similar statistical behavior, as the numbers can be identifiers. What's worse, if the numbers are user identifiers, assuming 32429122 and 32429123 to behave similarly may lead to privacy issues. Such issue happens, if user 32429122 preferences are used to provide personalization for the user 32429123.

$numeric

Aito provides special $numeric proposition to support continuous numbers. The $numeric proposition is used to signal aito that a numeric value behaves like a continuous number, instead of behaving like a categorical value or an identifier.

$numeric proposition will automatically form a bin for the number in order to find more samples with similar statistical properties. For example, if you have the number: 3.234912 Aito can for example find the numeric range 3.1-3.4 with 25 samples to use in inference.

Here's an example of using $numeric in the predict query:

{
  "from": "products",
  "where": {
    "price": { "$numeric" : 3.5 }
  },
  "predict": "tags",
  "exclusiveness" : false,
  "select": ["feature", "$p", "$why"]
}

Manual binning

The $numeric proposition can only be used to bin a value. It cannot be used to bin all field contents. This means that it cannot be used for the predicted, matched, recommended or related field.

The $numeric binning is also done query time, which brings performance implications: doing $numeric query is slower than doing a normal value query.

These issues can be avoided by manually binning the numbers, when writing the data. For example, if you have a decimal field called decimalA, you could create a field decimalA_bin. If the decimal has a uniform distribution between 0 and 1000 and you have 10000 samples, you could divide the samples into 100 bins of size 10. This would give you buckets [0, 10], [10, 20], [20, 30] and so forth. With 100 bins, each bin would contain approximately 100 items.

After you have chosen the bins, you should create decimalA_bin string field in the schema. When writing the data, as an addition to writing 13.435 in the decimalA field, you would write "0-10" in the decimalA_bin field. When querying the data, you would use "decimalA_bin":"10-20" instead of value "decimalA":12.4664.

To set the binning manually, you can follow the following rule: if you have N items, do the binning so that you end up with √N bins with each bin filled with √N items.

See Wikipedia data binning article to read more about the topic.

Back to developer docs

Locations

Kaivokatu 10 A, 8th floor

00100 Helsinki

Finland

See map

470 Ramona St.

Palo Alto

CA 94301, USA

See map

Contact

COVID-19 situation has driven us all to work from homes, please connect with us online. Stay safe & play with data!

Join our public Slack workspace

Follow us