How it works¶
All of the dataframe libraries we cater to use arrow as internal data storage (at least, optionally). Apache Arrow is designed for columnar in-memory representation of tabular data (i.e., dataframes), but has much additional functionality such as interoperability with the parquet storage format.
While arrow itself and most dataframe-oriented libraries consider simple columns with types like “numer” and “text” - SQL-like - arrow can store nested records and variable-length lists while keeping compact storage in arrays of values. Since the values are in contiguous arrays, data processing can be done with high-performance vectorised operations.
Awkward Array is a library for vectorised processing of data containing
nested and variable-length structures, with an API familiar to users
of numpy. It can interoperate with arrow in-memory data.
So the purpose of this library, is to bring the efficient processing of awkward to complex data types in dataframes. Since arrow is a common denominator for many analysis platforms, we are able to offer the same API and workflows across all of them.
Where there is no accessor registration mechanism, akimbo uses
“patching” to attach the .ak accessor to the target dataframe
objects. This always happens by importing the specific submodule for
the given dataframe library, so others will not be imported/affected
by akimbo.
Lazy versus Eager¶
The dataframe implementations fall into two categories. The “eager” ones
create new dataframes on every operation as soon as the operation is
encountered. For example df["a"] + 1 for Pandas results in new values
being stored in memory immediately. For other libraries, each operation is instead
stored in a pipeline, and only executed when the user has reached the output
they want to see. For instance, pyspark would not even allow the above
syntax, and df.select(col("a") + 1) would not cause any work to be
done by itself. This second type, “lazy” dataframes, allow for more optimization,
since the whole workflow is known by the tie compute is requested.
From akimbo’s point of view, this means that for lazy dataframes, we need
to infer the data type that a given operation will produce before seeing any
of the data. This is possible _in_most_cases_, because the data types
are strongly constrained. However, if writing opaque functions (such as
with Numba), you may wish to catch the case where the input is empty,
and explicitly return data of the shape you expect to get with real data. In
all, this should allow the dataframe library to successfully perform
all the optimizations it can; but it’s always good practice to select only
the columns and data partitions you expect to need anyway.
Data Round-trips¶
Whilst the awkward structure of data is compatible with arrow’s, the metadata
is a little different and there are some edge cases. This means, that for every
ak operation, there is some overhead for converting to and from. That means,
that akimbo will always be a bit slower at performing tasks that your
dataframe library can already do, and this will be more noticeable for many
operations on small frames as opposed to few operations on big frames.
It might be beneficial in such cases to use .ak.apply with or
without Numba, and do many awkward operations in one go per round-trip.