DataFrames in Python

Author

Marie-Hélène Burle

In programming languages, tabular data is handled in objects called DataFrames (or data frames or Data Frames). They were first implemented in the statistical language S, then in R, then they made it to Python.

In this section, we will have a briefly look at the tools available in Python to work with DataFrames.

pandas

As the first (and only) implementation of DataFrames in Python, pandas was the de facto DataFrame library for a very long time. Free and open-source, it is built on top of the array library NumPy and was directly inspired by R Data Frames.

pandas is still widely used and you will come across it everywhere. Looking back on its design, even its creator Wes McKinney saw flaws in its implementation.

Since July 1, 2024 there has been a new and much faster library called Polars. For this reason, we decided not to teach pandas in this course but to focus instead on the newer and better tool.

Polars

Polars is a free and open-source new framework for DataFrames in Rust, R, JS, Ruby, and Python. It uses Apache Arrow columnar memory format which is the new standard for efficiency. It allows for lazy evaluation (which allows it to work with more data than can fit in memory), multi-threaded queries, SIMD vectorization, automatic parallelization, and better support for missing data.

We are only covering a succinct introduction to Polars here, but we will cover more in the section on plotting with seaborn later in this course. You can also have a look at our introductory course on Polars for more details, our course on using Polars on GPU, and our two webinars on Polars (here and here).

DataFrame exploration

Let’s use the la_riots dataset, an open-source dataset on fatalities during the civil unrest in Los Angeles in April and May 1992, provided by the plotting library Vega-Altair. The dataset is hosted online as a CSV file.

You can read in a CSV file (local or from the Internet) with polars.read_csv:

import polars as pl

file_path = "https://cdn.jsdelivr.net/npm/vega-datasets/data/la-riots.csv"

df = pl.read_csv(file_path)

Let’s have a look at the first 5 rows of data:

df.head()
shape: (5, 11)
first_name last_name age gender race death_date address neighborhood type longitude latitude
str str i64 str str str str str str f64 f64
"Cesar A." "Aguilar" 18 "Male" "Latino" "1992-04-30" "2009 W. 6th St." "Westlake" "Officer-involved shooting" -118.273976 34.059281
"George" "Alvarez" 42 "Male" "Latino" "1992-05-01" "Main & College streets" "Chinatown" "Not riot-related" -118.234098 34.06269
"Wilson" "Alvarez" 40 "Male" "Latino" "1992-05-23" "3100 Rosecrans Ave." "Hawthorne" "Homicide" -118.326816 33.901662
"Brian E." "Andrew" 30 "Male" "Black" "1992-04-30" "Rosecrans & Chester avenues" "Compton" "Officer-involved shooting" -118.21539 33.903457
"Vivian" "Austin" 87 "Female" "Black" "1992-05-03" "1600 W. 60th St." "Harvard Park" "Death" -118.304741 33.985667

The list of columns (variable names) can be accessed with the columns attribute:

print(df.columns)
['first_name', 'last_name', 'age', 'gender', 'race', 'death_date', 'address', 'neighborhood', 'type', 'longitude', 'latitude']

An overview of the structure of the data can be accessed with the method glimpse:

df.glimpse()
Rows: 63
Columns: 11
$ first_name   <str> 'Cesar A.', 'George', 'Wilson', 'Brian E.', 'Vivian', 'Franklin', 'Carol', 'Patrick', 'Hector', 'Jerel L.'
$ last_name    <str> 'Aguilar', 'Alvarez', 'Alvarez', 'Andrew', 'Austin', 'Benavidez', 'Benson', 'Bettan', 'Castro', 'Channell'
$ age          <i64> 18, 42, 40, 30, 87, 27, 42, 30, 49, 26
$ gender       <str> 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male'
$ race         <str> 'Latino', 'Latino', 'Latino', 'Black', 'Black', 'Latino', 'Black', 'White', 'Latino', 'Black'
$ death_date   <str> '1992-04-30', '1992-05-01', '1992-05-23', '1992-04-30', '1992-05-03', '1992-04-30', '1992-05-02', '1992-04-30', '1992-04-30', '1992-04-30'
$ address      <str> '2009 W. 6th St.', 'Main & College streets', '3100 Rosecrans Ave.', 'Rosecrans & Chester avenues', '1600 W. 60th St.', '4404 S. Western Ave.', 'Harbor Freeway near Slauson Avenue', '2740 W. Olympic Blvd.', 'Vermont & Leeward avenues', 'Santa Monica Boulevard & Seward Street'
$ neighborhood <str> 'Westlake', 'Chinatown', 'Hawthorne', 'Compton', 'Harvard Park', 'Vermont Square', 'South Park', 'Koreatown', 'Koreatown', 'Hollywood'
$ type         <str> 'Officer-involved shooting', 'Not riot-related', 'Homicide', 'Officer-involved shooting', 'Death', 'Officer-involved shooting', 'Death', 'Homicide', 'Homicide', 'Death'
$ longitude    <f64> -118.2739756, -118.2340982, -118.326816, -118.2153903, -118.304741, -118.3088215, -118.2805037, -118.293181, -118.291654, -118.3323783
$ latitude     <f64> 34.0592814, 34.0626901, 33.901662, 33.9034569, 33.985667, 34.0034731, 33.98916756, 34.052068, 34.0587022, 34.09129756

And summary statistics with the method describe:

df.describe()
shape: (9, 12)
statistic first_name last_name age gender race death_date address neighborhood type longitude latitude
str str str f64 str str str str str str f64 f64
"count" "63" "63" 62.0 "63" "63" "63" "63" "63" "63" 63.0 63.0
"null_count" "0" "0" 1.0 "0" "0" "0" "0" "0" "0" 0.0 0.0
"mean" null null 32.370968 null null null null null null -118.27991 34.026713
"std" null null 14.253253 null null null null null null 0.105198 0.098471
"min" "Aaron" "Aguilar" 15.0 "Female" "Asian" "1992-04-29" "1005 S. Fresno St." "Altadena" "Death" -118.471745 33.789857
"25%" null null 21.0 null null null null null null -118.309822 33.97418
"50%" null null 31.0 null null null null null null -118.291495 34.005485
"75%" null null 38.0 null null null null null null -118.253197 34.070238
"max" "Wilson" "Williams" 87.0 "Male" "White" "1993-11-24" "near North Los Robles Avenue &… "Westlake" "Officer-involved shooting" -117.730647 34.287098

Subsetting

Let’s read in another online dataset from Vega-Altair into a DataFrame:

df = pl.read_csv("https://cdn.jsdelivr.net/npm/vega-datasets/data/disasters.csv")

df
shape: (803, 3)
Entity Year Deaths
str i64 i64
"All natural disasters" 1900 1267360
"All natural disasters" 1901 200018
"All natural disasters" 1902 46037
"All natural disasters" 1903 6506
"All natural disasters" 1905 22758
"Wildfire" 2013 35
"Wildfire" 2014 16
"Wildfire" 2015 67
"Wildfire" 2016 39
"Wildfire" 2017 75

We can subset it by rows:

df_sub_row = df.filter(pl.col("Year") == 2001)

df_sub_row
shape: (9, 3)
Entity Year Deaths
str i64 i64
"All natural disasters" 2001 39493
"Drought" 2001 99
"Earthquake" 2001 21348
"Epidemic" 2001 8515
"Extreme temperature" 2001 1787
"Extreme weather" 2001 1911
"Flood" 2001 5014
"Landslide" 2001 786
"Wildfire" 2001 33

Or by columns:

df_sub_col = df.select(
    pl.col("Entity"),
    pl.col("Year")
    )

df_sub_col
shape: (803, 2)
Entity Year
str i64
"All natural disasters" 1900
"All natural disasters" 1901
"All natural disasters" 1902
"All natural disasters" 1903
"All natural disasters" 1905
"Wildfire" 2013
"Wildfire" 2014
"Wildfire" 2015
"Wildfire" 2016
"Wildfire" 2017

Transformations

Selected columns can be modified:

df_col_mod = df.select(
    pl.col("Entity"),
    pl.col("Year"),
    (pl.col("Deaths") / 1000).alias("Kilodeaths")
)

df_col_mod
shape: (803, 3)
Entity Year Kilodeaths
str i64 f64
"All natural disasters" 1900 1267.36
"All natural disasters" 1901 200.018
"All natural disasters" 1902 46.037
"All natural disasters" 1903 6.506
"All natural disasters" 1905 22.758
"Wildfire" 2013 0.035
"Wildfire" 2014 0.016
"Wildfire" 2015 0.067
"Wildfire" 2016 0.039
"Wildfire" 2017 0.075

The transformed columns can be created alongside all the columns of the original DataFrame:

df_mod = df.with_columns((pl.col("Deaths") / 1000).alias("Kilodeaths"))

df_mod
shape: (803, 4)
Entity Year Deaths Kilodeaths
str i64 i64 f64
"All natural disasters" 1900 1267360 1267.36
"All natural disasters" 1901 200018 200.018
"All natural disasters" 1902 46037 46.037
"All natural disasters" 1903 6506 6.506
"All natural disasters" 1905 22758 22.758
"Wildfire" 2013 35 0.035
"Wildfire" 2014 16 0.016
"Wildfire" 2015 67 0.067
"Wildfire" 2016 39 0.039
"Wildfire" 2017 75 0.075

Group by operations

If you want to perform operations on rows sharing the same value for some variable, you group those rows with polars.DataFrame.group_by.

For instance, if we want to get the total number of deaths for each category of disaster, we can do:

totals = df.group_by(
    pl.col("Entity")
).agg(
    pl.col("Deaths").sum()
)

totals
shape: (11, 2)
Entity Deaths
str i64
"Extreme weather" 1396601
"Volcanic activity" 96366
"Drought" 11731294
"Extreme temperature" 182604
"Wildfire" 3925
"Landslide" 63068
"Flood" 6954992
"Mass movement (dry)" 5030
"Epidemic" 9596463
"All natural disasters" 32607156

Notice that the rows became out of order. Not to worry about order makes the code more efficient and does not affect future subsetting of our DataFrame. If you want to maintain the order however, you can use the maintain_order parameter (but this slows down the operation):

totals = df.group_by(
    pl.col("Entity"),
    maintain_order=True
).agg(
    pl.col("Deaths").sum()
)

totals
shape: (11, 2)
Entity Deaths
str i64
"All natural disasters" 32607156
"Drought" 11731294
"Earthquake" 2576801
"Epidemic" 9596463
"Extreme temperature" 182604
"Flood" 6954992
"Landslide" 63068
"Mass movement (dry)" 5030
"Volcanic activity" 96366
"Wildfire" 3925

Your turn:

Create a new DataFrame, ordered by year, that shows the total number of deaths for each year:

shape: (117, 2)
Year Deaths
i64 i64
1900 1267360
1901 200018
1902 46037
1903 6506
1905 22758
2013 22225
2014 20882
2015 23893
2016 10201
2017 2087

Combining contexts

select, with_columns, filter, and group_by are called contexts in the Polars terminology (the data transformations performed in these contexts are called expressions).

Contexts can be combined. For instance, we can create a new DataFrame with the number of deaths for each decade:

decade_totals = df.filter(
    pl.col("Entity") == "All natural disasters"
).with_columns(
    (pl.col("Year") // 10 * 10).alias("Decade")
).group_by(
    pl.col("Decade"),
    maintain_order=True
).agg(
    pl.col("Deaths").sum()
)

decade_totals
shape: (12, 2)
Decade Deaths
i64 i64
1900 4497847
1910 3326492
1920 8726293
1930 4701024
1940 3871695
1970 986867
1980 796782
1990 527613
2000 839986
2010 454950

Or one with the number of deaths for that decade for each type of disaster:

decade_totals_by_type = df.with_columns(
    (pl.col("Year") // 10 * 10).alias("Decade"),
).group_by([pl.col("Decade"), pl.col("Entity")],
    maintain_order=True
).agg(
    pl.col("Deaths").sum()
)

decade_totals_by_type
shape: (122, 3)
Decade Entity Deaths
i64 str i64
1900 "All natural disasters" 4497847
1910 "All natural disasters" 3326492
1920 "All natural disasters" 8726293
1930 "All natural disasters" 4701024
1940 "All natural disasters" 3871695
1970 "Wildfire" 5
1980 "Wildfire" 396
1990 "Wildfire" 859
2000 "Wildfire" 629
2010 "Wildfire" 429

Lazy evaluation

When it comes to high-performance computing, one of the strengths of Polars is that it supports lazy evaluation. Lazy evaluation instantly returns a future that can be used without waiting for the result of the computation. Moreover, when you run queries on a LazyFrame, Polars creates a graph and runs optimizations on it, very much the way compiled languages work.

If you want to speedup your code, use lazy execution whenever possible and try to use the lazy API from the start, when reading a file.

In the previous examples, we used polars.read_csv to read our data. This returns a Polars DataFrame:

url = "https://cdn.jsdelivr.net/npm/vega-datasets/data/disasters.csv"

df = pl.read_csv(url)
type(df)
polars.dataframe.frame.DataFrame

Instead, you can use polars.scan_csv to create a LazyFrame:

df_lazy = pl.scan_csv(url)
type(df_lazy)
polars.lazyframe.frame.LazyFrame

If you already have a DataFrame, you can create a LazyFrame from it with the polars.DataFrame.lazy method:

df_lazy = df.lazy()

To get results from a LazyFrame, you use polars.LazyFrame.collect.

This won’t work because a LazyFrame has no attribute shape:

df_lazy.filter(pl.col("Year") == 2001).shape
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[40], line 1
----> 1 df_lazy.filter(pl.col("Year") == 2001).shape

AttributeError: 'LazyFrame' object has no attribute 'shape'

You need to collect the result first:

df_lazy.filter(pl.col("Year") == 2001).collect().shape
(9, 3)

collect turns your LazyFrame into a DataFrame, but it only does so on the subset needed for your query:

type(df_lazy.filter(pl.col("Year") == 2001).collect())
polars.dataframe.frame.DataFrame

This allows you to work with data too big to fit in memory!