How it works
============

All of the dataframe libraries we cater to use arrow as internal data storage
(at least, optionally). `Apache Arrow`_ is designed for columnar in-memory
representation of tabular data (i.e., dataframes), but has much additional
functionality such as interoperability with the parquet storage format.

.. _Apache Arrow: https://arrow.apache.org/docs/index.html

While arrow itself and most dataframe-oriented libraries consider simple columns
with types like "numer" and "text" - SQL-like - arrow can store nested records and
variable-length lists while keeping compact storage in arrays of values. Since
the values are in contiguous arrays, data processing can be done with
high-performance vectorised operations.

`Awkward Array`_ is a library for vectorised processing of data containing
nested and variable-length structures, with an API familiar to users
of ``numpy``. It can interoperate with arrow in-memory data.

.. _Awkward Array: https://awkward-array.org/doc/main/

So the purpose of this library, is to bring the efficient processing of
awkward to complex data types in dataframes. Since arrow is a common
denominator for many analysis platforms, we are able to offer the same
API and workflows across all of them.

Where there is no accessor registration mechanism, ``akimbo`` uses
"patching" to attach the ``.ak`` accessor to the target dataframe
objects. This always happens by importing the specific submodule for
the given dataframe library, so others will not be imported/affected
by ``akimbo``.

Lazy versus Eager
-----------------

The dataframe implementations fall into two categories. The "eager" ones
create new dataframes on every operation as soon as the operation is
encountered. For example ``df["a"] + 1`` for Pandas results in new values
being stored in memory immediately. For other libraries, each operation is instead
stored in a pipeline, and only executed when the user has reached the output
they want to see. For instance, ``pyspark`` would not even allow the above
syntax, and ``df.select(col("a") + 1)`` would not cause any work to be
done by itself. This second type, "lazy" dataframes, allow for more optimization,
since the whole workflow is known by the tie compute is requested.

From ``akimbo``'s point of view, this means that for lazy dataframes, we need
to infer the data type that a given operation will produce before seeing any
of the data. This is possible _in_most_cases_, because the data types
are strongly constrained. However, if writing opaque functions (such as
with Numba), you may wish to catch the case where the input is empty,
and explicitly return data of the shape you expect to get with real data. In
all, this should allow the dataframe library to successfully perform
all the optimizations it can; but it's always good practice to select only
the columns and data partitions you expect to need anyway.

Data Round-trips
----------------

Whilst the awkward structure of data is compatible with arrow's, the metadata
is a little different and there are some edge cases. This means, that for every
``ak`` operation, there is some overhead for converting to and from. That means,
that ``akimbo`` will always be a bit slower at performing tasks that your
dataframe library can already do, and this will be more noticeable for many
operations on small frames as opposed to few operations on big frames.

It might be beneficial in such cases to use ``.ak.apply`` with or
without Numba, and do many awkward operations in one go per round-trip.