import polars as pl
file_path = "https://cdn.jsdelivr.net/npm/vega-datasets/data/la-riots.csv"
df = pl.read_csv(file_path)DataFrames in Python
In programming languages, tabular data is handled in objects called DataFrames (or data frames or Data Frames). They were first implemented in the statistical language S, then in R, then they made it to Python.
In this section, we will have a briefly look at the tools available in Python to work with DataFrames.
pandas
As the first (and only) implementation of DataFrames in Python, pandas was the de facto DataFrame library for a very long time. Free and open-source, it is built on top of the array library NumPy and was directly inspired by R Data Frames.
pandas is still widely used and you will come across it everywhere. Looking back on its design, even its creator Wes McKinney saw flaws in its implementation.
Since July 1, 2024 there has been a new and much faster library called Polars. For this reason, we decided not to teach pandas in this course but to focus instead on the newer and better tool.
Polars
Polars is a free and open-source new framework for DataFrames in Rust, R, JS, Ruby, and Python. It uses Apache Arrow columnar memory format which is the new standard for efficiency. It allows for lazy evaluation (which allows it to work with more data than can fit in memory), multi-threaded queries, SIMD vectorization, automatic parallelization, and better support for missing data.
We are only covering a succinct introduction to Polars here, but we will cover more in the section on plotting with seaborn later in this course. You can also have a look at our introductory course on Polars for more details. Finally, we will offer a course on using Polars on GPU later this term.
DataFrame exploration
Let’s use the la_riots dataset, an open-source dataset on fatalities during the civil unrest in Los Angeles in April and May 1992, provided by the plotting library Vega-Altair. The dataset is hosted online as a CSV file.
You can read in a CSV file (local or from the Internet) with polars.read_csv:
Let’s have a look at the first 5 rows of data:
print(df.head())shape: (5, 11)
┌────────────┬───────────┬─────┬────────┬───┬──────────────┬─────────────┬─────────────┬───────────┐
│ first_name ┆ last_name ┆ age ┆ gender ┆ … ┆ neighborhood ┆ type ┆ longitude ┆ latitude │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str ┆ ┆ str ┆ str ┆ f64 ┆ f64 │
╞════════════╪═══════════╪═════╪════════╪═══╪══════════════╪═════════════╪═════════════╪═══════════╡
│ Cesar A. ┆ Aguilar ┆ 18 ┆ Male ┆ … ┆ Westlake ┆ Officer-inv ┆ -118.273976 ┆ 34.059281 │
│ ┆ ┆ ┆ ┆ ┆ ┆ olved ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ shooting ┆ ┆ │
│ George ┆ Alvarez ┆ 42 ┆ Male ┆ … ┆ Chinatown ┆ Not riot-re ┆ -118.234098 ┆ 34.06269 │
│ ┆ ┆ ┆ ┆ ┆ ┆ lated ┆ ┆ │
│ Wilson ┆ Alvarez ┆ 40 ┆ Male ┆ … ┆ Hawthorne ┆ Homicide ┆ -118.326816 ┆ 33.901662 │
│ Brian E. ┆ Andrew ┆ 30 ┆ Male ┆ … ┆ Compton ┆ Officer-inv ┆ -118.21539 ┆ 33.903457 │
│ ┆ ┆ ┆ ┆ ┆ ┆ olved ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ shooting ┆ ┆ │
│ Vivian ┆ Austin ┆ 87 ┆ Female ┆ … ┆ Harvard Park ┆ Death ┆ -118.304741 ┆ 33.985667 │
└────────────┴───────────┴─────┴────────┴───┴──────────────┴─────────────┴─────────────┴───────────┘
The list of columns (variable names) can be accessed with the columns attribute:
print(df.columns)['first_name', 'last_name', 'age', 'gender', 'race', 'death_date', 'address', 'neighborhood', 'type', 'longitude', 'latitude']
An overview of the structure of the data can be accessed with the method glimpse:
df.glimpse()Rows: 63
Columns: 11
$ first_name <str> 'Cesar A.', 'George', 'Wilson', 'Brian E.', 'Vivian', 'Franklin', 'Carol', 'Patrick', 'Hector', 'Jerel L.'
$ last_name <str> 'Aguilar', 'Alvarez', 'Alvarez', 'Andrew', 'Austin', 'Benavidez', 'Benson', 'Bettan', 'Castro', 'Channell'
$ age <i64> 18, 42, 40, 30, 87, 27, 42, 30, 49, 26
$ gender <str> 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male'
$ race <str> 'Latino', 'Latino', 'Latino', 'Black', 'Black', 'Latino', 'Black', 'White', 'Latino', 'Black'
$ death_date <str> '1992-04-30', '1992-05-01', '1992-05-23', '1992-04-30', '1992-05-03', '1992-04-30', '1992-05-02', '1992-04-30', '1992-04-30', '1992-04-30'
$ address <str> '2009 W. 6th St.', 'Main & College streets', '3100 Rosecrans Ave.', 'Rosecrans & Chester avenues', '1600 W. 60th St.', '4404 S. Western Ave.', 'Harbor Freeway near Slauson Avenue', '2740 W. Olympic Blvd.', 'Vermont & Leeward avenues', 'Santa Monica Boulevard & Seward Street'
$ neighborhood <str> 'Westlake', 'Chinatown', 'Hawthorne', 'Compton', 'Harvard Park', 'Vermont Square', 'South Park', 'Koreatown', 'Koreatown', 'Hollywood'
$ type <str> 'Officer-involved shooting', 'Not riot-related', 'Homicide', 'Officer-involved shooting', 'Death', 'Officer-involved shooting', 'Death', 'Homicide', 'Homicide', 'Death'
$ longitude <f64> -118.2739756, -118.2340982, -118.326816, -118.2153903, -118.304741, -118.3088215, -118.2805037, -118.293181, -118.291654, -118.3323783
$ latitude <f64> 34.0592814, 34.0626901, 33.901662, 33.9034569, 33.985667, 34.0034731, 33.98916756, 34.052068, 34.0587022, 34.09129756
And summary statistics with the method describe:
print(df.describe())shape: (9, 12)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ statistic ┆ first_nam ┆ last_name ┆ age ┆ … ┆ neighborh ┆ type ┆ longitude ┆ latitude │
│ --- ┆ e ┆ --- ┆ --- ┆ ┆ ood ┆ --- ┆ --- ┆ --- │
│ str ┆ --- ┆ str ┆ f64 ┆ ┆ --- ┆ str ┆ f64 ┆ f64 │
│ ┆ str ┆ ┆ ┆ ┆ str ┆ ┆ ┆ │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ count ┆ 63 ┆ 63 ┆ 62.0 ┆ … ┆ 63 ┆ 63 ┆ 63.0 ┆ 63.0 │
│ null_coun ┆ 0 ┆ 0 ┆ 1.0 ┆ … ┆ 0 ┆ 0 ┆ 0.0 ┆ 0.0 │
│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ mean ┆ null ┆ null ┆ 32.370968 ┆ … ┆ null ┆ null ┆ -118.2799 ┆ 34.02671 │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 1 ┆ 3 │
│ std ┆ null ┆ null ┆ 14.253253 ┆ … ┆ null ┆ null ┆ 0.105198 ┆ 0.098471 │
│ min ┆ Aaron ┆ Aguilar ┆ 15.0 ┆ … ┆ Altadena ┆ Death ┆ -118.4717 ┆ 33.78985 │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 45 ┆ 7 │
│ 25% ┆ null ┆ null ┆ 21.0 ┆ … ┆ null ┆ null ┆ -118.3098 ┆ 33.97418 │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 22 ┆ │
│ 50% ┆ null ┆ null ┆ 31.0 ┆ … ┆ null ┆ null ┆ -118.2914 ┆ 34.00548 │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 95 ┆ 5 │
│ 75% ┆ null ┆ null ┆ 38.0 ┆ … ┆ null ┆ null ┆ -118.2531 ┆ 34.07023 │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 97 ┆ 8 │
│ max ┆ Wilson ┆ Williams ┆ 87.0 ┆ … ┆ Westlake ┆ Officer-i ┆ -117.7306 ┆ 34.28709 │
│ ┆ ┆ ┆ ┆ ┆ ┆ nvolved ┆ 47 ┆ 8 │
│ ┆ ┆ ┆ ┆ ┆ ┆ shooting ┆ ┆ │
└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘
Subsetting
Let’s read in another online dataset from Vega-Altair into a DataFrame:
df = pl.read_csv("https://cdn.jsdelivr.net/npm/vega-datasets/data/disasters.csv")
print(df)shape: (803, 3)
┌───────────────────────┬──────┬─────────┐
│ Entity ┆ Year ┆ Deaths │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═══════════════════════╪══════╪═════════╡
│ All natural disasters ┆ 1900 ┆ 1267360 │
│ All natural disasters ┆ 1901 ┆ 200018 │
│ All natural disasters ┆ 1902 ┆ 46037 │
│ All natural disasters ┆ 1903 ┆ 6506 │
│ All natural disasters ┆ 1905 ┆ 22758 │
│ … ┆ … ┆ … │
│ Wildfire ┆ 2013 ┆ 35 │
│ Wildfire ┆ 2014 ┆ 16 │
│ Wildfire ┆ 2015 ┆ 67 │
│ Wildfire ┆ 2016 ┆ 39 │
│ Wildfire ┆ 2017 ┆ 75 │
└───────────────────────┴──────┴─────────┘
We can subset it by rows:
df_sub_row = df.filter(pl.col("Year") == 2001)
print(df_sub_row)shape: (9, 3)
┌───────────────────────┬──────┬────────┐
│ Entity ┆ Year ┆ Deaths │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═══════════════════════╪══════╪════════╡
│ All natural disasters ┆ 2001 ┆ 39493 │
│ Drought ┆ 2001 ┆ 99 │
│ Earthquake ┆ 2001 ┆ 21348 │
│ Epidemic ┆ 2001 ┆ 8515 │
│ Extreme temperature ┆ 2001 ┆ 1787 │
│ Extreme weather ┆ 2001 ┆ 1911 │
│ Flood ┆ 2001 ┆ 5014 │
│ Landslide ┆ 2001 ┆ 786 │
│ Wildfire ┆ 2001 ┆ 33 │
└───────────────────────┴──────┴────────┘
Or by columns:
df_sub_col = df.select(
pl.col("Entity"),
pl.col("Year")
)
print(df_sub_col)shape: (803, 2)
┌───────────────────────┬──────┐
│ Entity ┆ Year │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════════════════════╪══════╡
│ All natural disasters ┆ 1900 │
│ All natural disasters ┆ 1901 │
│ All natural disasters ┆ 1902 │
│ All natural disasters ┆ 1903 │
│ All natural disasters ┆ 1905 │
│ … ┆ … │
│ Wildfire ┆ 2013 │
│ Wildfire ┆ 2014 │
│ Wildfire ┆ 2015 │
│ Wildfire ┆ 2016 │
│ Wildfire ┆ 2017 │
└───────────────────────┴──────┘
Transformations
Selected columns can be modified:
df_col_mod = df.select(
pl.col("Entity"),
pl.col("Year"),
(pl.col("Deaths") / 1000).alias("Kilodeaths")
)
print(df_col_mod)shape: (803, 3)
┌───────────────────────┬──────┬────────────┐
│ Entity ┆ Year ┆ Kilodeaths │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═══════════════════════╪══════╪════════════╡
│ All natural disasters ┆ 1900 ┆ 1267.36 │
│ All natural disasters ┆ 1901 ┆ 200.018 │
│ All natural disasters ┆ 1902 ┆ 46.037 │
│ All natural disasters ┆ 1903 ┆ 6.506 │
│ All natural disasters ┆ 1905 ┆ 22.758 │
│ … ┆ … ┆ … │
│ Wildfire ┆ 2013 ┆ 0.035 │
│ Wildfire ┆ 2014 ┆ 0.016 │
│ Wildfire ┆ 2015 ┆ 0.067 │
│ Wildfire ┆ 2016 ┆ 0.039 │
│ Wildfire ┆ 2017 ┆ 0.075 │
└───────────────────────┴──────┴────────────┘
The transformed columns can be created alongside all the columns of the original DataFrame:
df_mod = df.with_columns((pl.col("Deaths") / 1000).alias("Kilodeaths"))
print(df_mod)shape: (803, 4)
┌───────────────────────┬──────┬─────────┬────────────┐
│ Entity ┆ Year ┆ Deaths ┆ Kilodeaths │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 │
╞═══════════════════════╪══════╪═════════╪════════════╡
│ All natural disasters ┆ 1900 ┆ 1267360 ┆ 1267.36 │
│ All natural disasters ┆ 1901 ┆ 200018 ┆ 200.018 │
│ All natural disasters ┆ 1902 ┆ 46037 ┆ 46.037 │
│ All natural disasters ┆ 1903 ┆ 6506 ┆ 6.506 │
│ All natural disasters ┆ 1905 ┆ 22758 ┆ 22.758 │
│ … ┆ … ┆ … ┆ … │
│ Wildfire ┆ 2013 ┆ 35 ┆ 0.035 │
│ Wildfire ┆ 2014 ┆ 16 ┆ 0.016 │
│ Wildfire ┆ 2015 ┆ 67 ┆ 0.067 │
│ Wildfire ┆ 2016 ┆ 39 ┆ 0.039 │
│ Wildfire ┆ 2017 ┆ 75 ┆ 0.075 │
└───────────────────────┴──────┴─────────┴────────────┘
Group by operations
If you want to perform operations on rows sharing the same value for some variable, you group those rows with polars.DataFrame.group_by.
For instance, if we want to get the total number of deaths for each category of disaster, we can do:
totals = df.group_by(
pl.col("Entity")
).agg(
pl.col("Deaths").sum()
)
print(totals)shape: (11, 2)
┌───────────────────────┬──────────┐
│ Entity ┆ Deaths │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════════════════════╪══════════╡
│ Mass movement (dry) ┆ 5030 │
│ Flood ┆ 6954992 │
│ Landslide ┆ 63068 │
│ Earthquake ┆ 2576801 │
│ Epidemic ┆ 9596463 │
│ … ┆ … │
│ Volcanic activity ┆ 96366 │
│ Drought ┆ 11731294 │
│ Extreme weather ┆ 1396601 │
│ All natural disasters ┆ 32607156 │
│ Wildfire ┆ 3925 │
└───────────────────────┴──────────┘
Notice that the rows became out of order. Not to worry about order makes the code more efficient and does not affect future subsetting of our DataFrame. If you want to maintain the order however, you can use the maintain_order parameter (but this slows down the operation):
totals = df.group_by(
pl.col("Entity"),
maintain_order=True
).agg(
pl.col("Deaths").sum()
)
print(totals)shape: (11, 2)
┌───────────────────────┬──────────┐
│ Entity ┆ Deaths │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════════════════════╪══════════╡
│ All natural disasters ┆ 32607156 │
│ Drought ┆ 11731294 │
│ Earthquake ┆ 2576801 │
│ Epidemic ┆ 9596463 │
│ Extreme temperature ┆ 182604 │
│ … ┆ … │
│ Flood ┆ 6954992 │
│ Landslide ┆ 63068 │
│ Mass movement (dry) ┆ 5030 │
│ Volcanic activity ┆ 96366 │
│ Wildfire ┆ 3925 │
└───────────────────────┴──────────┘
Your turn:
Create a new DataFrame, ordered by year, that shows the total number of deaths for each year:
shape: (117, 2)
┌──────┬─────────┐
│ Year ┆ Deaths │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪═════════╡
│ 1900 ┆ 1267360 │
│ 1901 ┆ 200018 │
│ 1902 ┆ 46037 │
│ 1903 ┆ 6506 │
│ 1905 ┆ 22758 │
│ … ┆ … │
│ 2013 ┆ 22225 │
│ 2014 ┆ 20882 │
│ 2015 ┆ 23893 │
│ 2016 ┆ 10201 │
│ 2017 ┆ 2087 │
└──────┴─────────┘
Combining contexts
select, with_columns, filter, and group_by are called contexts in the Polars terminology (the data transformations performed in these contexts are called expressions).
Contexts can be combined. For instance, we can create a new DataFrame with the number of deaths for each decade:
decade_totals = df.filter(
pl.col("Entity") == "All natural disasters"
).with_columns(
(pl.col("Year") // 10 * 10).alias("Decade")
).group_by(
pl.col("Decade"),
maintain_order=True
).agg(
pl.col("Deaths").sum()
)
print(decade_totals)shape: (12, 2)
┌────────┬─────────┐
│ Decade ┆ Deaths │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════╪═════════╡
│ 1900 ┆ 4497847 │
│ 1910 ┆ 3326492 │
│ 1920 ┆ 8726293 │
│ 1930 ┆ 4701024 │
│ 1940 ┆ 3871695 │
│ … ┆ … │
│ 1970 ┆ 986867 │
│ 1980 ┆ 796782 │
│ 1990 ┆ 527613 │
│ 2000 ┆ 839986 │
│ 2010 ┆ 454950 │
└────────┴─────────┘
Or one with the number of deaths for that decade for each type of disaster:
decade_totals_by_type = df.with_columns(
(pl.col("Year") // 10 * 10).alias("Decade"),
).group_by([pl.col("Decade"), pl.col("Entity")],
maintain_order=True
).agg(
pl.col("Deaths").sum()
)
print(decade_totals_by_type)shape: (122, 3)
┌────────┬───────────────────────┬─────────┐
│ Decade ┆ Entity ┆ Deaths │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞════════╪═══════════════════════╪═════════╡
│ 1900 ┆ All natural disasters ┆ 4497847 │
│ 1910 ┆ All natural disasters ┆ 3326492 │
│ 1920 ┆ All natural disasters ┆ 8726293 │
│ 1930 ┆ All natural disasters ┆ 4701024 │
│ 1940 ┆ All natural disasters ┆ 3871695 │
│ … ┆ … ┆ … │
│ 1970 ┆ Wildfire ┆ 5 │
│ 1980 ┆ Wildfire ┆ 396 │
│ 1990 ┆ Wildfire ┆ 859 │
│ 2000 ┆ Wildfire ┆ 629 │
│ 2010 ┆ Wildfire ┆ 429 │
└────────┴───────────────────────┴─────────┘
Lazy evaluation
When it comes to high-performance computing, one of the strengths of Polars is that it supports lazy evaluation. Lazy evaluation instantly returns a future that can be used without waiting for the result of the computation. Moreover, when you run queries on a LazyFrame, Polars creates a graph and runs optimizations on it, very much the way compiled languages work.
If you want to speedup your code, use lazy execution whenever possible and try to use the lazy API from the start, when reading a file.
In the previous examples, we used polars.read_csv to read our data. This returns a Polars DataFrame:
url = "https://cdn.jsdelivr.net/npm/vega-datasets/data/disasters.csv"
df = pl.read_csv(url)
type(df)polars.dataframe.frame.DataFrame
Instead, you can use polars.scan_csv to create a LazyFrame:
df_lazy = pl.scan_csv(url)
type(df_lazy)polars.lazyframe.frame.LazyFrame
If you already have a DataFrame, you can create a LazyFrame from it with the polars.DataFrame.lazy method:
df_lazy = df.lazy()To get results from a LazyFrame, you use polars.LazyFrame.collect.
This won’t work because a LazyFrame has no attribute shape:
df_lazy.filter(pl.col("Year") == 2001).shape--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[19], line 1 ----> 1 df_lazy.filter(pl.col("Year") == 2001).shape AttributeError: 'LazyFrame' object has no attribute 'shape'
You need to collect the result first:
df_lazy.filter(pl.col("Year") == 2001).collect().shape(9, 3)
collect turns your LazyFrame into a DataFrame, but it only does so on the subset needed for your query:
type(df_lazy.filter(pl.col("Year") == 2001).collect())polars.dataframe.frame.DataFrame
This allows you to work with data too big to fit in memory!