DataFrames inspection

Author

Marie-Hélène Burle

Once we have a DataFrame, it is important to quickly get some basic information about it. In this section, we will see how to do so.

Let’s use the la_riots dataset, an open-source dataset on fatalities during the civil unrest in Los Angeles in April and May 1992, provided by the plotting library Vega-Altair. The dataset is hosted online as a CSV file.

You can read in a CSV file (local or from the Internet) with polars.read_csv:

import polars as pl

df = pl.read_csv("https://cdn.jsdelivr.net/npm/vega-datasets/data/la-riots.csv")
print(df)
shape: (63, 11)
┌─────────────┬───────────┬─────┬────────┬───┬─────────────┬─────────────┬─────────────┬───────────┐
│ first_name  ┆ last_name ┆ age ┆ gender ┆ … ┆ neighborhoo ┆ type        ┆ longitude   ┆ latitude  │
│ ---         ┆ ---       ┆ --- ┆ ---    ┆   ┆ d           ┆ ---         ┆ ---         ┆ ---       │
│ str         ┆ str       ┆ i64 ┆ str    ┆   ┆ ---         ┆ str         ┆ f64         ┆ f64       │
│             ┆           ┆     ┆        ┆   ┆ str         ┆             ┆             ┆           │
╞═════════════╪═══════════╪═════╪════════╪═══╪═════════════╪═════════════╪═════════════╪═══════════╡
│ Cesar A.    ┆ Aguilar   ┆ 18  ┆ Male   ┆ … ┆ Westlake    ┆ Officer-inv ┆ -118.273976 ┆ 34.059281 │
│             ┆           ┆     ┆        ┆   ┆             ┆ olved       ┆             ┆           │
│             ┆           ┆     ┆        ┆   ┆             ┆ shooting    ┆             ┆           │
│ George      ┆ Alvarez   ┆ 42  ┆ Male   ┆ … ┆ Chinatown   ┆ Not riot-re ┆ -118.234098 ┆ 34.06269  │
│             ┆           ┆     ┆        ┆   ┆             ┆ lated       ┆             ┆           │
│ Wilson      ┆ Alvarez   ┆ 40  ┆ Male   ┆ … ┆ Hawthorne   ┆ Homicide    ┆ -118.326816 ┆ 33.901662 │
│ Brian E.    ┆ Andrew    ┆ 30  ┆ Male   ┆ … ┆ Compton     ┆ Officer-inv ┆ -118.21539  ┆ 33.903457 │
│             ┆           ┆     ┆        ┆   ┆             ┆ olved       ┆             ┆           │
│             ┆           ┆     ┆        ┆   ┆             ┆ shooting    ┆             ┆           │
│ Vivian      ┆ Austin    ┆ 87  ┆ Female ┆ … ┆ Harvard     ┆ Death       ┆ -118.304741 ┆ 33.985667 │
│             ┆           ┆     ┆        ┆   ┆ Park        ┆             ┆             ┆           │
│ …           ┆ …         ┆ …   ┆ …      ┆ … ┆ …           ┆ …           ┆ …           ┆ …         │
│ Fredrick    ┆ Ward      ┆ 20  ┆ Male   ┆ … ┆ Pacoima     ┆ Homicide    ┆ -118.412778 ┆ 34.287098 │
│ Louis A.    ┆ Watson    ┆ 18  ┆ Male   ┆ … ┆ Vermont     ┆ Homicide    ┆ -118.291557 ┆ 34.005244 │
│             ┆           ┆     ┆        ┆   ┆ Square      ┆             ┆             ┆           │
│ Elbert O.   ┆ Wilkins   ┆ 33  ┆ Male   ┆ … ┆ Gramercy    ┆ Homicide    ┆ -118.310004 ┆ 33.952767 │
│             ┆           ┆     ┆        ┆   ┆ Park        ┆             ┆             ┆           │
│ John H.     ┆ Willers   ┆ 37  ┆ Male   ┆ … ┆ Mission     ┆ Homicide    ┆ -118.46777  ┆ 34.263184 │
│             ┆           ┆     ┆        ┆   ┆ Hills       ┆             ┆             ┆           │
│ Willie      ┆ Williams  ┆ 29  ┆ Male   ┆ … ┆ Chesterfiel ┆ Death       ┆ -118.308952 ┆ 33.982363 │
│ Bernard     ┆           ┆     ┆        ┆   ┆ d Square    ┆             ┆             ┆           │
└─────────────┴───────────┴─────┴────────┴───┴─────────────┴─────────────┴─────────────┴───────────┘

Printing a few rows

Print first rows (5 by default):

print(df.head())
shape: (5, 11)
┌────────────┬───────────┬─────┬────────┬───┬──────────────┬─────────────┬─────────────┬───────────┐
│ first_name ┆ last_name ┆ age ┆ gender ┆ … ┆ neighborhood ┆ type        ┆ longitude   ┆ latitude  │
│ ---        ┆ ---       ┆ --- ┆ ---    ┆   ┆ ---          ┆ ---         ┆ ---         ┆ ---       │
│ str        ┆ str       ┆ i64 ┆ str    ┆   ┆ str          ┆ str         ┆ f64         ┆ f64       │
╞════════════╪═══════════╪═════╪════════╪═══╪══════════════╪═════════════╪═════════════╪═══════════╡
│ Cesar A.   ┆ Aguilar   ┆ 18  ┆ Male   ┆ … ┆ Westlake     ┆ Officer-inv ┆ -118.273976 ┆ 34.059281 │
│            ┆           ┆     ┆        ┆   ┆              ┆ olved       ┆             ┆           │
│            ┆           ┆     ┆        ┆   ┆              ┆ shooting    ┆             ┆           │
│ George     ┆ Alvarez   ┆ 42  ┆ Male   ┆ … ┆ Chinatown    ┆ Not riot-re ┆ -118.234098 ┆ 34.06269  │
│            ┆           ┆     ┆        ┆   ┆              ┆ lated       ┆             ┆           │
│ Wilson     ┆ Alvarez   ┆ 40  ┆ Male   ┆ … ┆ Hawthorne    ┆ Homicide    ┆ -118.326816 ┆ 33.901662 │
│ Brian E.   ┆ Andrew    ┆ 30  ┆ Male   ┆ … ┆ Compton      ┆ Officer-inv ┆ -118.21539  ┆ 33.903457 │
│            ┆           ┆     ┆        ┆   ┆              ┆ olved       ┆             ┆           │
│            ┆           ┆     ┆        ┆   ┆              ┆ shooting    ┆             ┆           │
│ Vivian     ┆ Austin    ┆ 87  ┆ Female ┆ … ┆ Harvard Park ┆ Death       ┆ -118.304741 ┆ 33.985667 │
└────────────┴───────────┴─────┴────────┴───┴──────────────┴─────────────┴─────────────┴───────────┘
print(df.head(2))
shape: (2, 11)
┌────────────┬───────────┬─────┬────────┬───┬──────────────┬─────────────┬─────────────┬───────────┐
│ first_name ┆ last_name ┆ age ┆ gender ┆ … ┆ neighborhood ┆ type        ┆ longitude   ┆ latitude  │
│ ---        ┆ ---       ┆ --- ┆ ---    ┆   ┆ ---          ┆ ---         ┆ ---         ┆ ---       │
│ str        ┆ str       ┆ i64 ┆ str    ┆   ┆ str          ┆ str         ┆ f64         ┆ f64       │
╞════════════╪═══════════╪═════╪════════╪═══╪══════════════╪═════════════╪═════════════╪═══════════╡
│ Cesar A.   ┆ Aguilar   ┆ 18  ┆ Male   ┆ … ┆ Westlake     ┆ Officer-inv ┆ -118.273976 ┆ 34.059281 │
│            ┆           ┆     ┆        ┆   ┆              ┆ olved       ┆             ┆           │
│            ┆           ┆     ┆        ┆   ┆              ┆ shooting    ┆             ┆           │
│ George     ┆ Alvarez   ┆ 42  ┆ Male   ┆ … ┆ Chinatown    ┆ Not riot-re ┆ -118.234098 ┆ 34.06269  │
│            ┆           ┆     ┆        ┆   ┆              ┆ lated       ┆             ┆           │
└────────────┴───────────┴─────┴────────┴───┴──────────────┴─────────────┴─────────────┴───────────┘

Print last rows (5 by default):

print(df.tail(2))
shape: (2, 11)
┌───────────────┬───────────┬─────┬────────┬───┬──────────────┬──────────┬─────────────┬───────────┐
│ first_name    ┆ last_name ┆ age ┆ gender ┆ … ┆ neighborhood ┆ type     ┆ longitude   ┆ latitude  │
│ ---           ┆ ---       ┆ --- ┆ ---    ┆   ┆ ---          ┆ ---      ┆ ---         ┆ ---       │
│ str           ┆ str       ┆ i64 ┆ str    ┆   ┆ str          ┆ str      ┆ f64         ┆ f64       │
╞═══════════════╪═══════════╪═════╪════════╪═══╪══════════════╪══════════╪═════════════╪═══════════╡
│ John H.       ┆ Willers   ┆ 37  ┆ Male   ┆ … ┆ Mission      ┆ Homicide ┆ -118.46777  ┆ 34.263184 │
│               ┆           ┆     ┆        ┆   ┆ Hills        ┆          ┆             ┆           │
│ Willie        ┆ Williams  ┆ 29  ┆ Male   ┆ … ┆ Chesterfield ┆ Death    ┆ -118.308952 ┆ 33.982363 │
│ Bernard       ┆           ┆     ┆        ┆   ┆ Square       ┆          ┆             ┆           │
└───────────────┴───────────┴─────┴────────┴───┴──────────────┴──────────┴─────────────┴───────────┘

Print random rows (this is very useful as the head and tail of your DataFrame may not be representative of your data):

import random

print(df.sample(4))
shape: (4, 11)
┌────────────┬───────────┬─────┬────────┬───┬─────────────────┬──────────┬─────────────┬───────────┐
│ first_name ┆ last_name ┆ age ┆ gender ┆ … ┆ neighborhood    ┆ type     ┆ longitude   ┆ latitude  │
│ ---        ┆ ---       ┆ --- ┆ ---    ┆   ┆ ---             ┆ ---      ┆ ---         ┆ ---       │
│ str        ┆ str       ┆ i64 ┆ str    ┆   ┆ str             ┆ str      ┆ f64         ┆ f64       │
╞════════════╪═══════════╪═════╪════════╪═══╪═════════════════╪══════════╪═════════════╪═══════════╡
│ Darnell R. ┆ Mallory   ┆ 18  ┆ Male   ┆ … ┆ Hollywood       ┆ Death    ┆ -118.334138 ┆ 34.090972 │
│ Paul D.    ┆ Horace    ┆ 38  ┆ Male   ┆ … ┆ Central-Alameda ┆ Homicide ┆ -118.247414 ┆ 34.021735 │
│ Louis A.   ┆ Watson    ┆ 18  ┆ Male   ┆ … ┆ Vermont Square  ┆ Homicide ┆ -118.291557 ┆ 34.005244 │
│ Ira F.     ┆ McCurry   ┆ 45  ┆ Male   ┆ … ┆ Green Meadows   ┆ Homicide ┆ -118.265163 ┆ 33.943756 │
└────────────┴───────────┴─────┴────────┴───┴─────────────────┴──────────┴─────────────┴───────────┘

Structure

Overview of the DataFrame and its structure:

df.glimpse()
Rows: 63
Columns: 11
$ first_name   <str> 'Cesar A.', 'George', 'Wilson', 'Brian E.', 'Vivian', 'Franklin', 'Carol', 'Patrick', 'Hector', 'Jerel L.'
$ last_name    <str> 'Aguilar', 'Alvarez', 'Alvarez', 'Andrew', 'Austin', 'Benavidez', 'Benson', 'Bettan', 'Castro', 'Channell'
$ age          <i64> 18, 42, 40, 30, 87, 27, 42, 30, 49, 26
$ gender       <str> 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male'
$ race         <str> 'Latino', 'Latino', 'Latino', 'Black', 'Black', 'Latino', 'Black', 'White', 'Latino', 'Black'
$ death_date   <str> '1992-04-30', '1992-05-01', '1992-05-23', '1992-04-30', '1992-05-03', '1992-04-30', '1992-05-02', '1992-04-30', '1992-04-30', '1992-04-30'
$ address      <str> '2009 W. 6th St.', 'Main & College streets', '3100 Rosecrans Ave.', 'Rosecrans & Chester avenues', '1600 W. 60th St.', '4404 S. Western Ave.', 'Harbor Freeway near Slauson Avenue', '2740 W. Olympic Blvd.', 'Vermont & Leeward avenues', 'Santa Monica Boulevard & Seward Street'
$ neighborhood <str> 'Westlake', 'Chinatown', 'Hawthorne', 'Compton', 'Harvard Park', 'Vermont Square', 'South Park', 'Koreatown', 'Koreatown', 'Hollywood'
$ type         <str> 'Officer-involved shooting', 'Not riot-related', 'Homicide', 'Officer-involved shooting', 'Death', 'Officer-involved shooting', 'Death', 'Homicide', 'Homicide', 'Death'
$ longitude    <f64> -118.2739756, -118.2340982, -118.326816, -118.2153903, -118.304741, -118.3088215, -118.2805037, -118.293181, -118.291654, -118.3323783
$ latitude     <f64> 34.0592814, 34.0626901, 33.901662, 33.9034569, 33.985667, 34.0034731, 33.98916756, 34.052068, 34.0587022, 34.09129756

This is similar to the str() function in R.

The list of columns (variable names) can be accessed with the columns attribute:

print(df.columns)
['first_name', 'last_name', 'age', 'gender', 'race', 'death_date', 'address', 'neighborhood', 'type', 'longitude', 'latitude']

To print a list of the data types of each variable, you can use:

print(df.dtypes)
[String, String, Int64, String, String, String, String, String, String, Float64, Float64]

But the printing of a Polars DataFrame already gives you this information (along with the shape).

The schema of a Polars DataFrame sets the names of the variables (columns) and their data types:

df.schema
Schema([('first_name', String),
        ('last_name', String),
        ('age', Int64),
        ('gender', String),
        ('race', String),
        ('death_date', String),
        ('address', String),
        ('neighborhood', String),
        ('type', String),
        ('longitude', Float64),
        ('latitude', Float64)])

Summary statistics

The statistics (mean, standard deviation, min, max, and quartiles) may not be meaningful, depending on your data, but it will always give you one useful piece of information: the number of missing values for each variable. Here, additionally, it is useful for the age variable:

print(df.describe())
shape: (9, 12)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ statistic ┆ first_nam ┆ last_name ┆ age       ┆ … ┆ neighborh ┆ type      ┆ longitude ┆ latitude │
│ ---       ┆ e         ┆ ---       ┆ ---       ┆   ┆ ood       ┆ ---       ┆ ---       ┆ ---      │
│ str       ┆ ---       ┆ str       ┆ f64       ┆   ┆ ---       ┆ str       ┆ f64       ┆ f64      │
│           ┆ str       ┆           ┆           ┆   ┆ str       ┆           ┆           ┆          │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ count     ┆ 63        ┆ 63        ┆ 62.0      ┆ … ┆ 63        ┆ 63        ┆ 63.0      ┆ 63.0     │
│ null_coun ┆ 0         ┆ 0         ┆ 1.0       ┆ … ┆ 0         ┆ 0         ┆ 0.0       ┆ 0.0      │
│ t         ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
│ mean      ┆ null      ┆ null      ┆ 32.370968 ┆ … ┆ null      ┆ null      ┆ -118.2799 ┆ 34.02671 │
│           ┆           ┆           ┆           ┆   ┆           ┆           ┆ 1         ┆ 3        │
│ std       ┆ null      ┆ null      ┆ 14.253253 ┆ … ┆ null      ┆ null      ┆ 0.105198  ┆ 0.098471 │
│ min       ┆ Aaron     ┆ Aguilar   ┆ 15.0      ┆ … ┆ Altadena  ┆ Death     ┆ -118.4717 ┆ 33.78985 │
│           ┆           ┆           ┆           ┆   ┆           ┆           ┆ 45        ┆ 7        │
│ 25%       ┆ null      ┆ null      ┆ 21.0      ┆ … ┆ null      ┆ null      ┆ -118.3098 ┆ 33.97418 │
│           ┆           ┆           ┆           ┆   ┆           ┆           ┆ 22        ┆          │
│ 50%       ┆ null      ┆ null      ┆ 31.0      ┆ … ┆ null      ┆ null      ┆ -118.2914 ┆ 34.00548 │
│           ┆           ┆           ┆           ┆   ┆           ┆           ┆ 95        ┆ 5        │
│ 75%       ┆ null      ┆ null      ┆ 38.0      ┆ … ┆ null      ┆ null      ┆ -118.2531 ┆ 34.07023 │
│           ┆           ┆           ┆           ┆   ┆           ┆           ┆ 97        ┆ 8        │
│ max       ┆ Wilson    ┆ Williams  ┆ 87.0      ┆ … ┆ Westlake  ┆ Officer-i ┆ -117.7306 ┆ 34.28709 │
│           ┆           ┆           ┆           ┆   ┆           ┆ nvolved   ┆ 47        ┆ 8        │
│           ┆           ┆           ┆           ┆   ┆           ┆ shooting  ┆           ┆          │
└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘