The world of data frames

Author

Marie-Hélène Burle

Let’s talk about data frames, how they came to the world of programming, how pandas had the monopoly for many years in Python, and how things are changing very quickly at the moment.

Tabular data

Many fields of machine learning, data science, and humanities rely on tabular data where:

  • columns hold variables and are homogeneous (same data type)—you can think of them as vectors,
  • rows contain observations and can be heterogeneous.

Early computer options to manipulate such data were limited to spreadsheets (e.g. Microsoft Excel).

Dataframes (data frames or DataFrames) are two dimensional objects that brought tabular data to programming.

Early history of data frames

After data frames emerged in S, then R, they were added to Python with the library pandas in 2008:

y1 1990 y2 2000 y1--y2 y3 2008 y2--y3 l1 S programming language l2 R l3 pandas (Python)

After which, pandas remained the Python data frame library for a long time.

Issues with pandas

Wes McKinney—the author of pandas—himself has complaints about pandas:

  • internals too far from the hardware,
  • no support for memory-mapped datasets,
  • poor performance in database and file ingest / export,
  • lack of proper support for missing data,
  • lack of memory use and RAM management transparency,
  • weak support for categorical data,
  • complex groupby operations awkward and slow,
  • appending data to a DataFrame tedious and costly,
  • limited and non-extensible type metadata,
  • eager evaluation model with no query planning,
  • slow and limited multicore algorithms for large datasets.

A rich new field

Over the past few years, there has been an explosion of faster alternatives.

Parallel computing

The Python global interpreter lock (GIL) gets in the way of multi-threading, but several libraries allow the use of Python on multiple cores:

Fugue provides a unified interface for distributed computing that works with all three libraries.

To use data frames on multiple cores, Dask and Spark have APIs for pandas and Modin provides a drop-in replacement for pandas in all three libraries.

Accelerators

RAPIDS brings data frames on the GPUs with the cuDF library and integration with pandas is easy.

Lazy out-of-core

Vaex exists as an alternative to pandas.

SQL

Structured query language (SQL) handles relational databases, but the distinction between SQL and data frame software is getting increasingly blurry with most libraries now able to handle both.

DuckDB is a very fast and popular option with good integration with pandas.

Many additional options such as dbt and the snowflake snowpark Python API exist, although integration with pandas is not always as good.

Polars

Polars makes use of Apache Arrow—the new memory standard.

Most libraries are developing an integration with Polars, lodging it nicely in the Python ecosystem.

Best data frame strategy

For maximum efficiency, the best strategy currently seems to be:

No matter the scenario, Polars is better than pandas and you should use it instead.