Data structures

Author

Marie-Hélène Burle

Polars provides two fundamental data structures: Series and DataFrames.

Series

In Polars, Series are one-dimensional and homogeneous (all elements have the same data type).

In other frameworks or languages (e.g. pandas, R), such data structure would be called vector.

import polars as pl

s1 = pl.Series(range(5))

s1
shape: (5,)
i64
0
1
2
3
4

Data types

Polars infers the data type from the data. Defaults are Int64 and Float64, but you can specify another type:

s2 = pl.Series(range(5), dtype=pl.Int32)

s2
shape: (5,)
i32
0
1
2
3
4

Named Series

Series can be named:

s3 = pl.Series("Name", ["Bob", "Luc", "Lucy"])

s3
shape: (3,)
Name
str
"Bob"
"Luc"
"Lucy"

DataFrames

DataFrames are two-dimensional and composed of named Series of equal length. This means that DataFrames can be heterogeneous, but that columns contain homogeneous data.

They can be created from:

  • lists of Series:
df1 = pl.DataFrame([s3, pl.Series("Colour", ["Red", "Green", "Blue"])])

df1
shape: (3, 2)
Name Colour
str str
"Bob" "Red"
"Luc" "Green"
"Lucy" "Blue"
  • dictionaries:
from datetime import date

df2 = pl.DataFrame(
    {
        "Date": [
            date(2024, 10, 1),
            date(2024, 10, 2),
            date(2024, 10, 3),
            date(2024, 10, 6)
        ],
        "Rain": [2.1, 0.5, 0.0, 1.8],
        "Cloud cover": [1, 1, 0, 2]
        }
    )

df2
shape: (4, 3)
Date Rain Cloud cover
date f64 i64
2024-10-01 2.1 1
2024-10-02 0.5 1
2024-10-03 0.0 0
2024-10-06 1.8 2
  • NumPy ndarrays:
import numpy as np

df3 = pl.DataFrame(np.array([(1, 2), (3, 4)]))

df3
shape: (2, 2)
column_0 column_1
i64 i64
1 2
3 4

Because NumPy ndarrays are stored in memory by rows, the values in the first dimension of the array fill in the first row. If you want to fill in the DataFrame by column, you use the orient parameter:

df4 = pl.DataFrame(np.array([(1, 2), (3, 4)]), orient="col")

df4
shape: (2, 2)
column_0 column_1
i64 i64
1 3
2 4

To specify column names, you can use the schema parameter:

df5 = pl.DataFrame(np.array([(1, 2), (3, 4)]), schema=["Var1", "Var2"])

df5
shape: (2, 2)
Var1 Var2
i64 i64
1 2
3 4

Data types

To specify data types different from the defaults, you also use the schema parameter:

df6 = pl.DataFrame(
    {
        "Rain": [2.1, 0.5, 0.0, 1.8],
        "Cloud cover": [1, 1, 0, 2],
    },
    schema={"Rain": pl.Float32, "Cloud cover": pl.Int32}
)

df6
shape: (4, 2)
Rain Cloud cover
f32 i32
2.1 1
0.5 1
0.0 0
1.8 2