Akimbo¶
The akimbo project provides a Dataframe accessor for various backend, that enable analysis of nested, non-tabular data in workflows. This will be much faster and memory efficient than iterating over python dicts/lists, which quickly becomes unfeasible for big data.
When you import akimbo, a new .ak accessor will appear on your
dataframes, allowing the fast vectorized processing of “awkward” data
(nested structures and variable-length ragged lists) held in columns, e.g.,
for pandas
# adds accessor for all pandas dataframes and series
import akimbo.pandas
df.ak
Features¶
Multi library support¶
Currently, we support the following dataframe libraries with identical syntax:
pandas
dask.dataframe
polars (eager and lazy, but not GPU)
cuDF
ray dataset
pyspark
duckDB
numpy-like API¶
for slicing and accessing data deep in nested structures,
Example: choose every second inner element in a list-of-lists
series.ak[:, ::2]
Any function, ufunc or aggregation at any level¶
For manipulating numerics at deeper levels of your nested structures or ragged arrays while maintaining the original layout
series.ak.abs() # absolute for all numerical values
series.ak.sum(axis=3) # sum over deeply nested level
np.sin(series.ak) # use a ufunc, which applies to any nested numerical data
series.ak + 1 # numpy-like broadcasting into deeper levels
Using the ak.transform and ak.apply methods gives you close control over
which part of the data is affected and how, for power users.
CPU/GPU numba support¶
Pass nested functions to numba for compiled-speed computes over your data where you need an algorithm more complex than can be easily written with the numpy-like API. This can also be used for aggregations in groupby/window operations. If your data is on the GPU, you can use numba-cuda with slight modifications to your original function.
@numba.njit
def sum_list_of_list(x):
total = 0
for x0 in x:
for x1 in x0:
total += x1
return total
series.ak.apply(sum_list_of_lists)
Object Behaviours¶
Where your struct has higher-level concept associated with it - the
fields have logical relationship with each other - you can define a
class to encode these behaviours as methods. For instance, you can
describe that an array of (x, y, z) is in fact a set of points in
3D space. The methods you
define will appear on the .ak accessor or can be used for ufunc and
operator overloads.
Sub-accessors¶
As an alternative to the object-oriented behaviours, developers may create
accessor namespaces that appear under .ak similar to the the builtin
.ak.str (strings ops) snd .ak.dt (datetime ops) included already.
Such subaccessors provide methods that can be mapped over specific
data types.
You can apply string and datetime operations to ragged/nested arrays of values, and they will only affect the appropriate parts of the structure without changing the layout.
series.ak.str.upper() # change all strings to upper case throughout the data
One experimental proof-of-concept is akimbo-ip, which provides fast vectorised
manipulations of IPv4/6 addresses and networks; and by using this through
the akimbo system, you can apply these methods to ragged/nested dataframes.
We may consider other domain specific functionality appropriate for
nested/variable-length data structures, such as spatial operations on polygons.
User Guide
Demos
API Reference