Data structures

Author

Marie-Hélène Burle

Polars provides two fundamental data structures: series and data frames.

Series

In Polars, series are one-dimensional and homogeneous (all elements have the same data type).

In other frameworks or languages (e.g. pandas, R), such data structure would be called a vector.

import polars as pl

s1 = pl.Series(range(5))
print(s1)

shape: (5,)
Series: '' [i64]
[
    0
    1
    2
    3
    4
]

Data types

Polars infers data types from the data. Defaults are Int64 and Float64. For other options, you can create typed series by specifying the type:

s2 = pl.Series(range(5), dtype=pl.Int32)
print(s2)

shape: (5,)
Series: '' [i32]
[
    0
    1
    2
    3
    4
]

Named series

Series can be named:

s3 = pl.Series("Name", ["Bob", "Luc", "Lucy"])
print(s3)

shape: (3,)
Series: 'Name' [str]
[
    "Bob"
    "Luc"
    "Lucy"
]

Data frames

Data frames are two-dimensional and composed of named series of equal lengths. This means that data frames are heterogeneous, but that columns contain homogeneous data.

They can be created from:

lists of series:

df1 = pl.DataFrame([s3, pl.Series("Colour", ["Red", "Green", "Blue"])])
print(df1)

shape: (3, 2)
┌──────┬────────┐
│ Name ┆ Colour │
│ ---  ┆ ---    │
│ str  ┆ str    │
╞══════╪════════╡
│ Bob  ┆ Red    │
│ Luc  ┆ Green  │
│ Lucy ┆ Blue   │
└──────┴────────┘

dictionaries:

from datetime import date

df2 = pl.DataFrame(
    {
        "Date": [
            date(2024, 10, 1),
            date(2024, 10, 2),
            date(2024, 10, 3),
            date(2024, 10, 6)
        ],
        "Rain": [2.1, 0.5, 0.0, 1.8],
        "Cloud cover": [1, 1, 0, 2]
        }
    )
print(df2)

shape: (4, 3)
┌────────────┬──────┬─────────────┐
│ Date       ┆ Rain ┆ Cloud cover │
│ ---        ┆ ---  ┆ ---         │
│ date       ┆ f64  ┆ i64         │
╞════════════╪══════╪═════════════╡
│ 2024-10-01 ┆ 2.1  ┆ 1           │
│ 2024-10-02 ┆ 0.5  ┆ 1           │
│ 2024-10-03 ┆ 0.0  ┆ 0           │
│ 2024-10-06 ┆ 1.8  ┆ 2           │
└────────────┴──────┴─────────────┘

NumPy ndarrays:

import numpy as np

df3 = pl.DataFrame(np.array([(1, 2), (3, 4)]))
print(df3)

shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 1        ┆ 2        │
│ 3        ┆ 4        │
└──────────┴──────────┘

Because NumPy ndarrays are stored in memory by rows, the values in the first dimension of the array fill in the first row. If you want to fill in the data frame by column, you use the orient parameter:

df4 = pl.DataFrame(np.array([(1, 2), (3, 4)]), orient="col")
print(df4)

shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 1        ┆ 3        │
│ 2        ┆ 4        │
└──────────┴──────────┘

To specify column names, you can use the schema parameter:

df5 = pl.DataFrame(np.array([(1, 2), (3, 4)]), schema=["Var1", "Var2"])
print(df5)

shape: (2, 2)
┌──────┬──────┐
│ Var1 ┆ Var2 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 2    │
│ 3    ┆ 4    │
└──────┴──────┘

Data types

To specify data types different from the default, you also use the schema parameter:

df6 = pl.DataFrame(
    {
        "Rain": [2.1, 0.5, 0.0, 1.8],
        "Cloud cover": [1, 1, 0, 2],
    },
    schema={"Rain": pl.Float32, "Cloud cover": pl.Int32}
)
print(df6)

shape: (4, 2)
┌──────┬─────────────┐
│ Rain ┆ Cloud cover │
│ ---  ┆ ---         │
│ f32  ┆ i32         │
╞══════╪═════════════╡
│ 2.1  ┆ 1           │
│ 0.5  ┆ 1           │
│ 0.0  ┆ 0           │
│ 1.8  ┆ 2           │
└──────┴─────────────┘