Getting started
Here are a few notes to get you started with Polars.
Installation
Personal computer
python -m venv ~/env # Create virtual env
source ~/env/bin/activate # Activate virtual env
pip install --upgrade pip # Update pip
pip install polars # Install Polars
Alliance clusters
Polars wheels are available for Polars (always prefer wheels when possible):
python -m venv ~/env # Create virtual env
source ~/env/bin/activate # Activate virtual env
pip install --upgrade pip --no-index # Update pip from wheel
pip install polars --no-index # Install Polars from wheel
Syntax
Overall, the syntax feels very similar to R’s dplyr from the tidyverse.
In particular, extracting data is not done by indexing, but with action verbs:
import polars as pl
= pl.DataFrame(
df
{"species": ["A", "B", "C"],
"number": [87, 13, 4],
"category": ["a", "b", "c"]
}
)
df
shape: (3, 3)
┌─────────┬────────┬──────────┐
│ species ┆ number ┆ category │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════════╪════════╪══════════╡
│ A ┆ 87 ┆ a │
│ B ┆ 13 ┆ b │
│ C ┆ 4 ┆ c │
└─────────┴────────┴──────────┘
filter(pl.col("number") > 20).select("category") df.
shape: (1, 1)
┌──────────┐
│ category │
│ --- │
│ str │
╞══════════╡
│ a │
└──────────┘
Performance tips
Use lazy execution where possible
We already saw that you can lazily read files with pl.scan_csv
instead of using pl.read_csv
.
Another option is to use the lazy
method.
Example:
= pl.DataFrame({"foo": ["a", "b", "c"], "bar": [0, 1, 2]}).lazy() df
The results get eagerly returned with the collect
method.
Data file format
A good file format to store large datasets is Apache Parquet. It is a columnar format (data is stored together by column instead of row as is the case for CSV files) and this allows better compression.
Migrating from Pandas
Read the migration guide: it will help you write Polars code rather than “literally translated” Pandas code that runs, but doesn’t make use of Polars’ strengths. The differences in style mostly come from the fact that Polars runs in parallel.