The world of data frames

Author

Marie-Hélène Burle

Let’s talk about data frames, how they came to the world of programming, how Pandas had the monopoly for many years in Python, and how things are changing very quickly at the moment.

Tabular data

Many fields of machine learning, data science, and humanities rely on tabular data where

columns hold variables and are homogeneous (same data type),
rows contain observations and can be heterogeneous.

Early computer options to manipulate such data were limited to spreadsheets (e.g. Microsoft Excel).

Dataframes (data frames or DataFrames) are two dimensional objects that brought tabular data to programming.

Early history of dataframes

After data frames emerged in S, then R, they were added to Python with the library Pandas in 2008.

The world was simple … but slow with high memory usage and it remains thus for a long time.

Issues with Pandas

While Pandas was the Python data frame library and is now at the core of many other libraries, Wes McKinney (pandas creator) himself has complaints about it, mostly:

the internals are too far from the hardware,
no support for memory-mapped datasets,
poor performance in database and file ingest / export,
lack of proper support for missing data,
lack of memory use and RAM management transparency,
weak support for categorical data,
groupby operations are complex, awkward, and slow,
appending data to a DataFrame tedious and very costly,
limited, non-extensible type metadata,
eager evaluation model with no query planning,
slow and limited multicore algorithms for large datasets.

A rich new field

After years of Pandas as the Python dataframe library, there is currently an exuberant explosion of faster alternatives.

Parallel computing

The Python global interpreter lock (GIL) gets in the way of multi-threading.

Libraries such as Ray, Dask, and Apache Spark allow the use of multiple cores and bring dataframes to clusters.

Dask and Spark have APIs for Pandas and Modin makes this even more trivial by providing a drop-in replacement for Pandas on Dask, Spark, and Ray.

fugue provides a unified interface for distributed computing that works on Spark, Dask, and Ray.

Accelerators

RAPIDS brings dataframes on the GPUs with the cuDF library.

Integration with pandas is easy.

Lazy out-of-core

Vaex exists as an alternative to pandas.

SQL

Structured query language (SQL) handles relational databases, but the distinction between SQL and dataframe software is getting increasingly blurry with most libraries now able to handle both.

DuckDB is a very fast and popular option with good integration with pandas.

Many additional options such as dbt and the snowflake snowpark Python API exist, although integration with pandas is not always as good.

Polars

And then came Polars.

The new memory standard is Apache Arrow and the most efficient library making use of it is Polars.

In addition, most libraries are developing an integration with Polars, lodging it nicely in the Python ecosystem.

For maximum dataframe efficiency, the best strategy currently seems to be:

single machine ➔ use Polars,
cluster ➔ use Polars + fugue (example benchmark, documentation of Polars integration),
GPUs available ➔ use Polars + RAPIDS library cuDF (Polars integration coming soon),
SQL ➔ use Polars + DuckDB (documentation of Polars integration),
combination of the above (e.g. cluster with GPUs) ➔ use a combination of the above tools (e.g. Polars + fugue + RAPIDS).

Comparison with Pandas

Overview

	Pandas	Polars
Available for	Python	Rust, Python, R, NodeJS
Written in	Cython	Rust
Multithreading	Some operations	Yes (GIL released)
Index	Rows are indexed	Integer positions are used
Evaluation	Eager only	Lazy and eager
Query optimizer	No	Yes
Out-of-core	No	Yes
SIMD vectorization	Yes	Yes
Data in memory	With NumPy arrays	With Apache Arrow arrays
Memory efficiency	Poor	Excellent
Handling of missing data	Inconsistent	Consistent, promotes type stability

Performance

You can run these on our training cluster by either of two methods:

Method 1

Launching JupyterHub.

Method 2

Logging on the cluster through SSH, and running the following in your command line:

module load ipykernel/2023b
source /project/def-sponsor00/shared/scientificpython-env/bin/activate
salloc --time=2:00:0 --mem-per-cpu=3600
ipython

Let’s go back to the FizzBuzz problem.

The best method with Pandas used masks. Let’s see how Polars fares in comparison.

First, let’s load the packages we will need:

import pandas as pd
import numpy as np
import polars as pl

And let’s make sure that the code works.

With Pandas:

df_pd = pd.DataFrame()
size = 10_000
df_pd["number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd.loc[df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz"

df_pd

      number response
0          1        1
1          2        2
2          3     Fizz
3          4        4
4          5     Buzz
...      ...      ...
9995    9996     Fizz
9996    9997     9997
9997    9998     9998
9998    9999     Fizz
9999   10000     Buzz

[10000 rows x 2 columns]

With Polars:

size = 10_000
df_pl = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl.with_columns(pl.col("number").cast(pl.String).alias("response"))
df_pl.with_columns(
    pl.when(pl.col("number") % 3 == 0)
    .then(pl.lit("Fizz"))
    .when(pl.col("number") % 5 == 0)
    .then(pl.lit("Buzz"))
    .when(pl.col("number") % 15 == 0)
    .then(pl.lit("FizzBuzz"))
    .otherwise(pl.col("number"))
    .alias("response")
)

shape: (10_000, 2)
┌────────┬──────────┐
│ number ┆ response │
│ ---    ┆ ---      │
│ i64    ┆ str      │
╞════════╪══════════╡
│ 1      ┆ 1        │
│ 2      ┆ 2        │
│ 3      ┆ Fizz     │
│ 4      ┆ 4        │
│ 5      ┆ Buzz     │
│ …      ┆ …        │
│ 9996   ┆ Fizz     │
│ 9997   ┆ 9997     │
│ 9998   ┆ 9998     │
│ 9999   ┆ Fizz     │
│ 10000  ┆ Buzz     │
└────────┴──────────┘

Now, let’s time them.

Pandas:

%%timeit

df_pd = pd.DataFrame()
size = 10_000
df_pd["number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd.loc[df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz"

4.75 ms ± 9.76 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Polars:

%%timeit

size = 10_000
df_pl = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl.with_columns(pl.col("number").cast(pl.String).alias("response"))
df_pl.with_columns(
    pl.when(pl.col("number") % 3 == 0)
    .then(pl.lit("Fizz"))
    .when(pl.col("number") % 5 == 0)
    .then(pl.lit("Buzz"))
    .when(pl.col("number") % 15 == 0)
    .then(pl.lit("FizzBuzz"))
    .otherwise(pl.col("number"))
    .alias("response")
)

518 μs ± 580 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

That’s a speedup of almost 10 (the longer the series, the larger this speedup will be).

Polars: 1, Pandas: 0 🙂

For a second example, let’s go back to the jeopardy example with a large file and compare the timing of Pandas and Polar.

Pandas:

%%timeit

df_pd = pd.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pd.loc[df_pd["Category"] == "HISTORY"].shape

887 ms ± 164 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars:

%%timeit

df_pl = pl.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").shape

446 ms ± 89.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That’s a speedup of 2.

But it gets even better: Polars support lazy evaluation.

Lazy evaluation is not yet implemented when reading files from the cloud (Polars is a very new tool, but its functionalities are expanding very fast). This means that we cannot test the benefit of lazy evaluation in our example by using the CSV file in its current location (https://github.com/pola-rs/polars/issues/13115).

I downloaded it on our training cluster however so that we can run the test.

First, let’s make sure that the code works.

Pandas:

df_pd = pd.read_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pd.loc[df_pd["Category"] == "HISTORY"].shape

(349, 7)

Polars:

df_pl = pl.scan_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").collect().shape

(349, 7)

And now for the timing.

Pandas:

%%timeit

df_pd = pd.read_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pd.loc[df_pd["Category"] == "HISTORY"].shape

331 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars:

%%timeit

df_pl = pl.scan_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").collect().shape

13.1 ms ± 175 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That’s a speedup of 25 (the larger the file, the larger this speedup will be). This is because pl.scan_csv doesn’t read the file. Instead, it creates a future. By using a lazy query, only the part of the file that is necessary actually gets read in. This potentially saves a lot of time for very large files and it even allows to work with files too large to fit in memory.

Lazy evaluation also allows the query optimizer to combine operations where possible, very much the way compiled languages work.

To evaluate the future and get a result, we use the collect method.

Note that Polars also has a pl.read_csv function if you want to use eager evaluation.

Polars: 2, Pandas: 0 🙂