akimbo

Accessor

Backends

akimbo.pandas.PandasAwkwardAccessor(obj[, ...])

Perform awkward operations on pandas data

akimbo.dask.DaskAwkwardAccessor(obj[, ...])

Perform awkward operations on a dask series or frame

akimbo.polars.PolarsAwkwardAccessor(obj[, ...])

Perform awkward operations on a polars series or dataframe

akimbo.cudf.CudfAwkwardAccessor(obj[, ...])

Operations on cuDF dataframes on the GPU.

class akimbo.pandas.PandasAwkwardAccessor(obj, behavior=None, subaccessor=None)[source]

Perform awkward operations on pandas data

Nested structures are handled using arrow as the storage backend. If you use pandas object columns (python lists, dicts, strings), they will be converted on any access to a .ak method.

class akimbo.dask.DaskAwkwardAccessor(obj, behavior=None, subaccessor=None)[source]

Perform awkward operations on a dask series or frame

These operations are lazy, because of how dask works. Note that we use mapping operations here, so any action on axis==0 or 1 will produce results per partition, which you must then combine.

To perform intra-partition operations, we recommend you use the .to_dask_awkward method.

Correct arrow dtypes will be deduced when the input is also arrow, which is now the default for the dask “dataframe.dtype_backend” config options.

class akimbo.polars.PolarsAwkwardAccessor(obj, behavior=None, subaccessor=None)[source]

Perform awkward operations on a polars series or dataframe

This is for eager operations. A Lazy version may eventually be made.

class akimbo.cudf.CudfAwkwardAccessor(obj, behavior=None, subaccessor=None)[source]

Operations on cuDF dataframes on the GPU.

Data are kept in GPU memory and use views rather than copies where possible.

Top Level Functions

read_parquet(url[, storage_options, ...])

Read a Parquet dataset with nested data into a Series or DataFrame.

read_json(url[, storage_options, schema, ...])

Read a JSON dataset with nested data into a Series or DataFrame.

read_avro(url[, storage_options, extract, ...])

Read AVRO structured data files

get_parquet_schema(path, *[, ...])

get_json_schema(url[, storage_options, nbytes])

Get JSONSchema representation of the contents of a line-delimited JSON file

get_avro_schema(url[, storage_options])

Fetch ak form of the schema defined in given avro file

Extensions

The following properties appear on the .ak accessor for data-type specific functions, mapped onto the structure of the column/frame being acted on. Check the dir() of each (or use tab-completion) to see the operations available.

class akimbo.datetimes.DatetimeAccessor[source]
class akimbo.strings.StringAccessor[source]

String operations on nested/var-length data

decode(arr, encoding: str = 'utf-8')[source]

Decode Series of bytes to Series of strings. Leaves non-bytestrings alone.

Validity of UTF8 is not checked.

encode(arr, encoding: str = 'utf-8')[source]

Encode Series of strings to Series of bytes. Leaves non-strings alone.

static join_el(arr, arr2, sep='')

Run vectorized functions on nested/ragged/complex array

where: None | str | Sequence[str, …]

if None, will attempt to apply the kernel throughout the nested structure, wherever correct types are encountered. If where is given, only the selected part of the structure will be considered, but the output will retain the original shape. A fieldname or sequence of fieldnames to descend into the tree are acceptable

match_kwargs: None | dict

any extra field identifiers for matching a record as OK to process

<function concat at 0x7d8ee3438400>

static repeat(arr, count)

Run vectorized functions on nested/ragged/complex array

where: None | str | Sequence[str, …]

if None, will attempt to apply the kernel throughout the nested structure, wherever correct types are encountered. If where is given, only the selected part of the structure will be considered, but the output will retain the original shape. A fieldname or sequence of fieldnames to descend into the tree are acceptable

match_kwargs: None | dict

any extra field identifiers for matching a record as OK to process

<function repeat at 0x7d8ee3438360>

static strptime(strings, /, format, unit, error_is_null=False, *, options=None, memory_pool=None)

Run vectorized functions on nested/ragged/complex array

where: None | str | Sequence[str, …]

if None, will attempt to apply the kernel throughout the nested structure, wherever correct types are encountered. If where is given, only the selected part of the structure will be considered, but the output will retain the original shape. A fieldname or sequence of fieldnames to descend into the tree are acceptable

match_kwargs: None | dict

any extra field identifiers for matching a record as OK to process

–Kernel documentation follows from the original function–

Parse timestamps.

For each string in strings, parse it as a timestamp. The timestamp unit and the expected string pattern must be given in StrptimeOptions. Null inputs emit null. If a non-null string fails parsing, an error is returned by default.

Parameters:
  • strings (Array-like or scalar-like) – Argument to compute function.

  • format (str) – Pattern for parsing input strings as timestamps, such as “%Y/%m/%d”. Note that the semantics of the format follow the C/C++ strptime, not the Python one. There are differences in behavior, for example how the “%y” placeholder handles years with less than four digits.

  • unit (str) – Timestamp unit of the output. Accepted values are “s”, “ms”, “us”, “ns”.

  • error_is_null (boolean, default False) – Return null on parsing errors if true or raise if false.

  • options (pyarrow.compute.StrptimeOptions, optional) – Alternative way of passing options.

  • memory_pool (pyarrow.MemoryPool, optional) – If not passed, will allocate memory from the default memory pool.

The cuDF backend also has these implemented with GPU-specific variants, akimbo.cudf.CudfStringAccessor and akimbo.cudf.CudfDatetimeAccessor.