Akimbo

The akimbo project provides a Dataframe accessor for various backend, that enable analysis of nested, non-tabular data in workflows. This will be much faster and memory efficient than iterating over python dicts/lists, which quickly becomes unfeasible for big data.

When you import akimbo, a new .ak accessor will appear on your dataframes, allowing the fast vectorized processing of “awkward” data (nested structures and variable-length ragged lists) held in columns, e.g., for pandas

# adds accessor for all pandas dataframes and series
import akimbo.pandas
df.ak

Features

Multi library support

Currently, we support the following dataframe libraries with identical syntax:

  • pandas

  • dask.dataframe

  • polars (eager and lazy, but not GPU)

  • cuDF

  • ray dataset

  • pyspark

  • duckDB

numpy-like API

for slicing and accessing data deep in nested structures,

Example: choose every second inner element in a list-of-lists

series.ak[:, ::2]

Any function, ufunc or aggregation at any level

For manipulating numerics at deeper levels of your nested structures or ragged arrays while maintaining the original layout

series.ak.abs()  # absolute for all numerical values
series.ak.sum(axis=3)  # sum over deeply nested level
np.sin(series.ak)  # use a ufunc, which applies to any nested numerical data
series.ak + 1  # numpy-like broadcasting into deeper levels

Using the ak.transform and ak.apply methods gives you close control over which part of the data is affected and how, for power users.

CPU/GPU numba support

Pass nested functions to numba for compiled-speed computes over your data where you need an algorithm more complex than can be easily written with the numpy-like API. This can also be used for aggregations in groupby/window operations. If your data is on the GPU, you can use numba-cuda with slight modifications to your original function.

@numba.njit
def sum_list_of_list(x):
    total = 0
    for x0 in x:
        for x1 in x0:
            total += x1
    return total


series.ak.apply(sum_list_of_lists)

Object Behaviours

Where your struct has higher-level concept associated with it - the fields have logical relationship with each other - you can define a class to encode these behaviours as methods. For instance, you can describe that an array of (x, y, z) is in fact a set of points in 3D space. The methods you define will appear on the .ak accessor or can be used for ufunc and operator overloads.

Sub-accessors

As an alternative to the object-oriented behaviours, developers may create accessor namespaces that appear under .ak similar to the the builtin .ak.str (strings ops) snd .ak.dt (datetime ops) included already. Such subaccessors provide methods that can be mapped over specific data types.

You can apply string and datetime operations to ragged/nested arrays of values, and they will only affect the appropriate parts of the structure without changing the layout.

series.ak.str.upper()  # change all strings to upper case throughout the data

One experimental proof-of-concept is akimbo-ip, which provides fast vectorised manipulations of IPv4/6 addresses and networks; and by using this through the akimbo system, you can apply these methods to ragged/nested dataframes. We may consider other domain specific functionality appropriate for nested/variable-length data structures, such as spatial operations on polygons.

API Reference