How we set up the data pipeline from dirty CSV to Aito using Docker

#dataengineering

#datacleaning

#pipeline

#docker

Photo by Hugo Marin

Tommi Holmgren

CPO

July 30, 2021 • 1 min read

Most of the customers we work with tend to start their journey with a proof of value/concept (lets just call them all PoC) project. It certainly makes a lot of sense, as one of the most critical tasks of the PoC project is to understand the state of the the available data, and test wether it will be likely to bring good results with the machine learning predictions.

For us, it often means s**tloads of data cleaning and wrangling. We have chose to do that work for our customers as part of the paid PoC projects, simply to ensure rapid progress and minimum wasted time. But boy this tends to take time and frustrate us - as we really can't spend endless hours with data janitoring.

After a couple of projects where we spent too many hours with invoice and accounting data exports from the client's SAP, we realised we need to improve our efficiency. We wanted to:

be able to make on edit, and run the entire pipeline again without hassle,
have something that is easily version controllable, without risking the leakage of customer data to repositories,
move most of work to a database so that we can use a commonly known query language to get datasets that are good for Aito,
get rid of endless jupyter/colab notebooks that are difficult to version control and distribute and,
create something that we can share with our customers so that they can easily replicate the cleaning steps later on, when developing production-grade data pipelines.

We chose to build this on Docker and Postgres. When the container starts, it creates the database and runs the scripts to ingest data from CSV files, clean what is needed, and then runs scripts to upload data in to Aito. While developing, we can start the container at any time, and have the database running and make the queries against the database to test different ideas and versions. Super simple. Lets have a detailed look at how we set it up!

Docker container configuration

First step is here. TODO.

version: "3"

services:
  postgres:
    image: postgres:13
    restart: always
    container_name: aito-customer-data-postgres
    environment:
      # Will create the specified user with
      # superuser power and a database with the same name.
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
    volumes:
      - type: volume
        # Volume must be present in the 'volumes'-block in this file
        source: postgres
        target: /var/lib/postgresql
      - type: bind
        source: ./tools/docker-postgres
        target: /docker-entrypoint-initdb.d
      - type: bind
        source: ./customer-data
        target: /customer-data

    ports:
        # host_port:container_port
        - "5432:5432"

volumes:
    postgres:

Second step

CREATE USER aito WITH PASSWORD 'aito';

Third step

CREATE DATABASE customer_data lc_collate 'en_US.UTF-8' lc_ctype 'en_US.UTF-8' encoding 'UTF8' template template0;
GRANT ALL PRIVILEGES ON DATABASE customer_data to aito;

Fourth step

#!/bin/bash
set -e
set -x

psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "customer_data" <<-EOSQL
     CREATE EXTENSION IF NOT EXISTS pgcrypto;
EOSQL