DataFrames in Python

Author

Marie-Hélène Burle

In programming languages, tabular data is handled in objects called DataFrames (or data frames or Data Frames). They were first implemented in the statistical language S, then in R, then they made it to Python.

In this section, we will have a briefly look at the tools available in Python to work with DataFrames.

pandas

As the first (and only) implementation of DataFrames in Python, pandas was the de facto DataFrame library for a very long time. Free and open-source, it is built on top of the array library NumPy and was directly inspired by R Data Frames.

pandas is still widely used and you will come across it everywhere. Looking back on its design, even its creator Wes McKinney saw flaws in its implementation.

Since July 1, 2024 there has been a new and much faster library called Polars. For this reason, we decided not to teach pandas in this course but to focus instead on the newer and better tool.

Polars

Polars is a free and open-source new framework for DataFrames in Rust, R, JS, Ruby, and Python. It uses Apache Arrow columnar memory format which is the new standard for efficiency. It allows for lazy evaluation (which allows it to work with more data than can fit in memory), multi-threaded queries, SIMD vectorization, automatic parallelization, and better support for missing data.

We are only covering a succinct introduction to Polars here, but we will cover more in the section on plotting with seaborn later in this course. You can also have a look at our introductory course on Polars for more details, our course on using Polars on GPU, and our two webinars on Polars (here and here).

DataFrame exploration

Let’s use the la_riots dataset, an open-source dataset on fatalities during the civil unrest in Los Angeles in April and May 1992, provided by the plotting library Vega-Altair. The dataset is hosted online as a CSV file.

You can read in a CSV file (local or from the Internet) with polars.read_csv:

import polars as pl

file_path = "https://cdn.jsdelivr.net/npm/vega-datasets/data/la-riots.csv"

df = pl.read_csv(file_path)

Let’s have a look at the first 5 rows of data:

df.head()

shape: (5, 11)

first_name	last_name	age	gender	race	death_date	address	neighborhood	type	longitude	latitude
str	str	i64	str	str	str	str	str	str	f64	f64
"Cesar A."	"Aguilar"	18	"Male"	"Latino"	"1992-04-30"	"2009 W. 6th St."	"Westlake"	"Officer-involved shooting"	-118.273976	34.059281
"George"	"Alvarez"	42	"Male"	"Latino"	"1992-05-01"	"Main & College streets"	"Chinatown"	"Not riot-related"	-118.234098	34.06269
"Wilson"	"Alvarez"	40	"Male"	"Latino"	"1992-05-23"	"3100 Rosecrans Ave."	"Hawthorne"	"Homicide"	-118.326816	33.901662
"Brian E."	"Andrew"	30	"Male"	"Black"	"1992-04-30"	"Rosecrans & Chester avenues"	"Compton"	"Officer-involved shooting"	-118.21539	33.903457
"Vivian"	"Austin"	87	"Female"	"Black"	"1992-05-03"	"1600 W. 60th St."	"Harvard Park"	"Death"	-118.304741	33.985667

The list of columns (variable names) can be accessed with the columns attribute:

print(df.columns)

['first_name', 'last_name', 'age', 'gender', 'race', 'death_date', 'address', 'neighborhood', 'type', 'longitude', 'latitude']

An overview of the structure of the data can be accessed with the method glimpse:

df.glimpse()

Rows: 63
Columns: 11
$ first_name   <str> 'Cesar A.', 'George', 'Wilson', 'Brian E.', 'Vivian', 'Franklin', 'Carol', 'Patrick', 'Hector', 'Jerel L.'
$ last_name    <str> 'Aguilar', 'Alvarez', 'Alvarez', 'Andrew', 'Austin', 'Benavidez', 'Benson', 'Bettan', 'Castro', 'Channell'
$ age          <i64> 18, 42, 40, 30, 87, 27, 42, 30, 49, 26
$ gender       <str> 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male'
$ race         <str> 'Latino', 'Latino', 'Latino', 'Black', 'Black', 'Latino', 'Black', 'White', 'Latino', 'Black'
$ death_date   <str> '1992-04-30', '1992-05-01', '1992-05-23', '1992-04-30', '1992-05-03', '1992-04-30', '1992-05-02', '1992-04-30', '1992-04-30', '1992-04-30'
$ address      <str> '2009 W. 6th St.', 'Main & College streets', '3100 Rosecrans Ave.', 'Rosecrans & Chester avenues', '1600 W. 60th St.', '4404 S. Western Ave.', 'Harbor Freeway near Slauson Avenue', '2740 W. Olympic Blvd.', 'Vermont & Leeward avenues', 'Santa Monica Boulevard & Seward Street'
$ neighborhood <str> 'Westlake', 'Chinatown', 'Hawthorne', 'Compton', 'Harvard Park', 'Vermont Square', 'South Park', 'Koreatown', 'Koreatown', 'Hollywood'
$ type         <str> 'Officer-involved shooting', 'Not riot-related', 'Homicide', 'Officer-involved shooting', 'Death', 'Officer-involved shooting', 'Death', 'Homicide', 'Homicide', 'Death'
$ longitude    <f64> -118.2739756, -118.2340982, -118.326816, -118.2153903, -118.304741, -118.3088215, -118.2805037, -118.293181, -118.291654, -118.3323783
$ latitude     <f64> 34.0592814, 34.0626901, 33.901662, 33.9034569, 33.985667, 34.0034731, 33.98916756, 34.052068, 34.0587022, 34.09129756

And summary statistics with the method describe:

df.describe()

shape: (9, 12)

statistic	first_name	last_name	age	gender	race	death_date	address	neighborhood	type	longitude	latitude
str	str	str	f64	str	str	str	str	str	str	f64	f64
"count"	"63"	"63"	62.0	"63"	"63"	"63"	"63"	"63"	"63"	63.0	63.0
"null_count"	"0"	"0"	1.0	"0"	"0"	"0"	"0"	"0"	"0"	0.0	0.0
"mean"	null	null	32.370968	null	null	null	null	null	null	-118.27991	34.026713
"std"	null	null	14.253253	null	null	null	null	null	null	0.105198	0.098471
"min"	"Aaron"	"Aguilar"	15.0	"Female"	"Asian"	"1992-04-29"	"1005 S. Fresno St."	"Altadena"	"Death"	-118.471745	33.789857
"25%"	null	null	21.0	null	null	null	null	null	null	-118.309822	33.97418
"50%"	null	null	31.0	null	null	null	null	null	null	-118.291495	34.005485
"75%"	null	null	38.0	null	null	null	null	null	null	-118.253197	34.070238
"max"	"Wilson"	"Williams"	87.0	"Male"	"White"	"1993-11-24"	"near North Los Robles Avenue &…	"Westlake"	"Officer-involved shooting"	-117.730647	34.287098

Subsetting

Let’s read in another online dataset from Vega-Altair into a DataFrame:

df = pl.read_csv("https://cdn.jsdelivr.net/npm/vega-datasets/data/disasters.csv")

df

shape: (803, 3)

Entity	Year	Deaths
str	i64	i64
"All natural disasters"	1900	1267360
"All natural disasters"	1901	200018
"All natural disasters"	1902	46037
"All natural disasters"	1903	6506
"All natural disasters"	1905	22758
…	…	…
"Wildfire"	2013	35
"Wildfire"	2014	16
"Wildfire"	2015	67
"Wildfire"	2016	39
"Wildfire"	2017	75

We can subset it by rows:

df_sub_row = df.filter(pl.col("Year") == 2001)

df_sub_row

shape: (9, 3)

Entity	Year	Deaths
str	i64	i64
"All natural disasters"	2001	39493
"Drought"	2001	99
"Earthquake"	2001	21348
"Epidemic"	2001	8515
"Extreme temperature"	2001	1787
"Extreme weather"	2001	1911
"Flood"	2001	5014
"Landslide"	2001	786
"Wildfire"	2001	33

Or by columns:

df_sub_col = df.select(
    pl.col("Entity"),
    pl.col("Year")
    )

df_sub_col

shape: (803, 2)

Entity	Year
str	i64
"All natural disasters"	1900
"All natural disasters"	1901
"All natural disasters"	1902
"All natural disasters"	1903
"All natural disasters"	1905
…	…
"Wildfire"	2013
"Wildfire"	2014
"Wildfire"	2015
"Wildfire"	2016
"Wildfire"	2017

Transformations

Selected columns can be modified:

df_col_mod = df.select(
    pl.col("Entity"),
    pl.col("Year"),
    (pl.col("Deaths") / 1000).alias("Kilodeaths")
)

df_col_mod

shape: (803, 3)

Entity	Year	Kilodeaths
str	i64	f64
"All natural disasters"	1900	1267.36
"All natural disasters"	1901	200.018
"All natural disasters"	1902	46.037
"All natural disasters"	1903	6.506
"All natural disasters"	1905	22.758
…	…	…
"Wildfire"	2013	0.035
"Wildfire"	2014	0.016
"Wildfire"	2015	0.067
"Wildfire"	2016	0.039
"Wildfire"	2017	0.075

The transformed columns can be created alongside all the columns of the original DataFrame:

df_mod = df.with_columns((pl.col("Deaths") / 1000).alias("Kilodeaths"))

df_mod

shape: (803, 4)

Entity	Year	Deaths	Kilodeaths
str	i64	i64	f64
"All natural disasters"	1900	1267360	1267.36
"All natural disasters"	1901	200018	200.018
"All natural disasters"	1902	46037	46.037
"All natural disasters"	1903	6506	6.506
"All natural disasters"	1905	22758	22.758
…	…	…	…
"Wildfire"	2013	35	0.035
"Wildfire"	2014	16	0.016
"Wildfire"	2015	67	0.067
"Wildfire"	2016	39	0.039
"Wildfire"	2017	75	0.075

Group by operations

If you want to perform operations on rows sharing the same value for some variable, you group those rows with polars.DataFrame.group_by.

For instance, if we want to get the total number of deaths for each category of disaster, we can do:

totals = df.group_by(
    pl.col("Entity")
).agg(
    pl.col("Deaths").sum()
)

totals

shape: (11, 2)

Entity	Deaths
str	i64
"Extreme weather"	1396601
"Volcanic activity"	96366
"Drought"	11731294
"Extreme temperature"	182604
"Wildfire"	3925
…	…
"Landslide"	63068
"Flood"	6954992
"Mass movement (dry)"	5030
"Epidemic"	9596463
"All natural disasters"	32607156

Notice that the rows became out of order. Not to worry about order makes the code more efficient and does not affect future subsetting of our DataFrame. If you want to maintain the order however, you can use the maintain_order parameter (but this slows down the operation):

totals = df.group_by(
    pl.col("Entity"),
    maintain_order=True
).agg(
    pl.col("Deaths").sum()
)

totals

shape: (11, 2)

Entity	Deaths
str	i64
"All natural disasters"	32607156
"Drought"	11731294
"Earthquake"	2576801
"Epidemic"	9596463
"Extreme temperature"	182604
…	…
"Flood"	6954992
"Landslide"	63068
"Mass movement (dry)"	5030
"Volcanic activity"	96366
"Wildfire"	3925

Your turn:

Create a new DataFrame, ordered by year, that shows the total number of deaths for each year:

shape: (117, 2)

Year	Deaths
i64	i64
1900	1267360
1901	200018
1902	46037
1903	6506
1905	22758
…	…
2013	22225
2014	20882
2015	23893
2016	10201
2017	2087

Combining contexts

select, with_columns, filter, and group_by are called contexts in the Polars terminology (the data transformations performed in these contexts are called expressions).

Contexts can be combined. For instance, we can create a new DataFrame with the number of deaths for each decade:

decade_totals = df.filter(
    pl.col("Entity") == "All natural disasters"
).with_columns(
    (pl.col("Year") // 10 * 10).alias("Decade")
).group_by(
    pl.col("Decade"),
    maintain_order=True
).agg(
    pl.col("Deaths").sum()
)

decade_totals

shape: (12, 2)

Decade	Deaths
i64	i64
1900	4497847
1910	3326492
1920	8726293
1930	4701024
1940	3871695
…	…
1970	986867
1980	796782
1990	527613
2000	839986
2010	454950

Or one with the number of deaths for that decade for each type of disaster:

decade_totals_by_type = df.with_columns(
    (pl.col("Year") // 10 * 10).alias("Decade"),
).group_by([pl.col("Decade"), pl.col("Entity")],
    maintain_order=True
).agg(
    pl.col("Deaths").sum()
)

decade_totals_by_type

shape: (122, 3)

Decade	Entity	Deaths
i64	str	i64
1900	"All natural disasters"	4497847
1910	"All natural disasters"	3326492
1920	"All natural disasters"	8726293
1930	"All natural disasters"	4701024
1940	"All natural disasters"	3871695
…	…	…
1970	"Wildfire"	5
1980	"Wildfire"	396
1990	"Wildfire"	859
2000	"Wildfire"	629
2010	"Wildfire"	429

Lazy evaluation

When it comes to high-performance computing, one of the strengths of Polars is that it supports lazy evaluation. Lazy evaluation instantly returns a future that can be used without waiting for the result of the computation. Moreover, when you run queries on a LazyFrame, Polars creates a graph and runs optimizations on it, very much the way compiled languages work.

If you want to speedup your code, use lazy execution whenever possible and try to use the lazy API from the start, when reading a file.

In the previous examples, we used polars.read_csv to read our data. This returns a Polars DataFrame:

url = "https://cdn.jsdelivr.net/npm/vega-datasets/data/disasters.csv"

df = pl.read_csv(url)
type(df)

polars.dataframe.frame.DataFrame

Instead, you can use polars.scan_csv to create a LazyFrame:

df_lazy = pl.scan_csv(url)
type(df_lazy)

polars.lazyframe.frame.LazyFrame

If you already have a DataFrame, you can create a LazyFrame from it with the polars.DataFrame.lazy method:

df_lazy = df.lazy()

To get results from a LazyFrame, you use polars.LazyFrame.collect.

This won’t work because a LazyFrame has no attribute shape:

df_lazy.filter(pl.col("Year") == 2001).shape

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[40], line 1
----> 1 df_lazy.filter(pl.col("Year") == 2001).shape

AttributeError: 'LazyFrame' object has no attribute 'shape'

You need to collect the result first:

df_lazy.filter(pl.col("Year") == 2001).collect().shape

(9, 3)

collect turns your LazyFrame into a DataFrame, but it only does so on the subset needed for your query:

type(df_lazy.filter(pl.col("Year") == 2001).collect())

polars.dataframe.frame.DataFrame

This allows you to work with data too big to fit in memory!