Data structures

Author

Marie-Hélène Burle

Polars provides two fundamental data structures: Series and DataFrames.

Series

In Polars, Series are one-dimensional and homogeneous (all elements have the same data type).

In other frameworks or languages (e.g. pandas, R), such data structure would be called a vector.

import polars as pl

s1 = pl.Series(range(5))
print(s1)

shape: (5,)
Series: '' [i64]
[
    0
    1
    2
    3
    4
]

Data types

Polars infers data types from the data. Defaults are Int64 and Float64. For other options, you can create typed Series by specifying the type:

s2 = pl.Series(range(5), dtype=pl.Int32)
print(s2)

shape: (5,)
Series: '' [i32]
[
    0
    1
    2
    3
    4
]

Named Series

Series can be named:

s3 = pl.Series("Name", ["Bob", "Luc", "Lucy"])
print(s3)

shape: (3,)
Series: 'Name' [str]
[
    "Bob"
    "Luc"
    "Lucy"
]

DataFrames

DataFrames are two-dimensional and composed of named Series of equal lengths. This means that DataFrames are heterogeneous, but that columns contain homogeneous data.

They can be created from:

lists of Series:

df1 = pl.DataFrame([s3, pl.Series("Colour", ["Red", "Green", "Blue"])])
print(df1)

shape: (3, 2)
┌──────┬────────┐
│ Name ┆ Colour │
│ ---  ┆ ---    │
│ str  ┆ str    │
╞══════╪════════╡
│ Bob  ┆ Red    │
│ Luc  ┆ Green  │
│ Lucy ┆ Blue   │
└──────┴────────┘

dictionaries:

from datetime import date

df2 = pl.DataFrame(
    {
        "Date": [
            date(2024, 10, 1),
            date(2024, 10, 2),
            date(2024, 10, 3),
            date(2024, 10, 6)
        ],
        "Rain": [2.1, 0.5, 0.0, 1.8],
        "Cloud cover": [1, 1, 0, 2]
        }
    )
print(df2)

shape: (4, 3)
┌────────────┬──────┬─────────────┐
│ Date       ┆ Rain ┆ Cloud cover │
│ ---        ┆ ---  ┆ ---         │
│ date       ┆ f64  ┆ i64         │
╞════════════╪══════╪═════════════╡
│ 2024-10-01 ┆ 2.1  ┆ 1           │
│ 2024-10-02 ┆ 0.5  ┆ 1           │
│ 2024-10-03 ┆ 0.0  ┆ 0           │
│ 2024-10-06 ┆ 1.8  ┆ 2           │
└────────────┴──────┴─────────────┘

NumPy ndarrays:

import numpy as np

df3 = pl.DataFrame(np.array([(1, 2), (3, 4)]))
print(df3)

shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 1        ┆ 2        │
│ 3        ┆ 4        │
└──────────┴──────────┘

Because NumPy ndarrays are stored in memory by rows, the values in the first dimension of the array fill in the first row. If you want to fill in the DataFrame by column, you use the orient parameter:

df4 = pl.DataFrame(np.array([(1, 2), (3, 4)]), orient="col")
print(df4)

shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 1        ┆ 3        │
│ 2        ┆ 4        │
└──────────┴──────────┘

To specify column names, you can use the schema parameter:

df5 = pl.DataFrame(np.array([(1, 2), (3, 4)]), schema=["Var1", "Var2"])
print(df5)

shape: (2, 2)
┌──────┬──────┐
│ Var1 ┆ Var2 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 2    │
│ 3    ┆ 4    │
└──────┴──────┘

Data types

To specify data types different from the default, you also use the schema parameter:

df6 = pl.DataFrame(
    {
        "Rain": [2.1, 0.5, 0.0, 1.8],
        "Cloud cover": [1, 1, 0, 2],
    },
    schema={"Rain": pl.Float32, "Cloud cover": pl.Int32}
)
print(df6)

shape: (4, 2)
┌──────┬─────────────┐
│ Rain ┆ Cloud cover │
│ ---  ┆ ---         │
│ f32  ┆ i32         │
╞══════╪═════════════╡
│ 2.1  ┆ 1           │
│ 0.5  ┆ 1           │
│ 0.0  ┆ 0           │
│ 1.8  ┆ 2           │
└──────┴─────────────┘