DataFrames inspection

Author

Marie-Hélène Burle

Once we have a DataFrame, it is important to quickly get some basic information about it. In this section, we will see how to do so.

Let’s use the la_riots dataset, an open-source dataset on fatalities during the civil unrest in Los Angeles in April and May 1992, provided by the plotting library Vega-Altair. The dataset is hosted online as a CSV file.

You can read in a CSV file (local or from the Internet) with polars.read_csv:

import polars as pl

df = pl.read_csv("https://cdn.jsdelivr.net/npm/vega-datasets/data/la-riots.csv")

df

shape: (63, 11)

first_name	last_name	age	gender	race	death_date	address	neighborhood	type	longitude	latitude
str	str	i64	str	str	str	str	str	str	f64	f64
"Cesar A."	"Aguilar"	18	"Male"	"Latino"	"1992-04-30"	"2009 W. 6th St."	"Westlake"	"Officer-involved shooting"	-118.273976	34.059281
"George"	"Alvarez"	42	"Male"	"Latino"	"1992-05-01"	"Main & College streets"	"Chinatown"	"Not riot-related"	-118.234098	34.06269
"Wilson"	"Alvarez"	40	"Male"	"Latino"	"1992-05-23"	"3100 Rosecrans Ave."	"Hawthorne"	"Homicide"	-118.326816	33.901662
"Brian E."	"Andrew"	30	"Male"	"Black"	"1992-04-30"	"Rosecrans & Chester avenues"	"Compton"	"Officer-involved shooting"	-118.21539	33.903457
"Vivian"	"Austin"	87	"Female"	"Black"	"1992-05-03"	"1600 W. 60th St."	"Harvard Park"	"Death"	-118.304741	33.985667
…	…	…	…	…	…	…	…	…	…	…
"Fredrick"	"Ward"	20	"Male"	"Black"	"1992-05-02"	"11932 Cometa Ave."	"Pacoima"	"Homicide"	-118.412778	34.287098
"Louis A."	"Watson"	18	"Male"	"Black"	"1992-04-29"	"4365 S. Vermont Ave."	"Vermont Square"	"Homicide"	-118.291557	34.005244
"Elbert O."	"Wilkins"	33	"Male"	"Black"	"1992-04-30"	"Western Avenue & 92nd Street"	"Gramercy Park"	"Homicide"	-118.310004	33.952767
"John H."	"Willers"	37	"Male"	"White"	"1992-04-29"	"10621 Sepulveda Blvd."	"Mission Hills"	"Homicide"	-118.46777	34.263184
"Willie Bernard"	"Williams"	29	"Male"	"Black"	"1992-04-29"	"Gage & Western avenues"	"Chesterfield Square"	"Death"	-118.308952	33.982363

Printing a few rows

Print first rows (5 by default):

df.head()

shape: (5, 11)

first_name	last_name	age	gender	race	death_date	address	neighborhood	type	longitude	latitude
str	str	i64	str	str	str	str	str	str	f64	f64
"Cesar A."	"Aguilar"	18	"Male"	"Latino"	"1992-04-30"	"2009 W. 6th St."	"Westlake"	"Officer-involved shooting"	-118.273976	34.059281
"George"	"Alvarez"	42	"Male"	"Latino"	"1992-05-01"	"Main & College streets"	"Chinatown"	"Not riot-related"	-118.234098	34.06269
"Wilson"	"Alvarez"	40	"Male"	"Latino"	"1992-05-23"	"3100 Rosecrans Ave."	"Hawthorne"	"Homicide"	-118.326816	33.901662
"Brian E."	"Andrew"	30	"Male"	"Black"	"1992-04-30"	"Rosecrans & Chester avenues"	"Compton"	"Officer-involved shooting"	-118.21539	33.903457
"Vivian"	"Austin"	87	"Female"	"Black"	"1992-05-03"	"1600 W. 60th St."	"Harvard Park"	"Death"	-118.304741	33.985667

df.head(2)

shape: (2, 11)

first_name	last_name	age	gender	race	death_date	address	neighborhood	type	longitude	latitude
str	str	i64	str	str	str	str	str	str	f64	f64
"Cesar A."	"Aguilar"	18	"Male"	"Latino"	"1992-04-30"	"2009 W. 6th St."	"Westlake"	"Officer-involved shooting"	-118.273976	34.059281
"George"	"Alvarez"	42	"Male"	"Latino"	"1992-05-01"	"Main & College streets"	"Chinatown"	"Not riot-related"	-118.234098	34.06269

Print last rows (5 by default):

df.tail(2)

shape: (2, 11)

first_name	last_name	age	gender	race	death_date	address	neighborhood	type	longitude	latitude
str	str	i64	str	str	str	str	str	str	f64	f64
"John H."	"Willers"	37	"Male"	"White"	"1992-04-29"	"10621 Sepulveda Blvd."	"Mission Hills"	"Homicide"	-118.46777	34.263184
"Willie Bernard"	"Williams"	29	"Male"	"Black"	"1992-04-29"	"Gage & Western avenues"	"Chesterfield Square"	"Death"	-118.308952	33.982363

Print random rows (this is very useful as the head and tail of your DataFrame may not be representative of your data):

import random

df.sample(4)

shape: (4, 11)

first_name	last_name	age	gender	race	death_date	address	neighborhood	type	longitude	latitude
str	str	i64	str	str	str	str	str	str	f64	f64
"Eduardo C."	"Vela"	33	"Male"	"Latino"	"1992-04-29"	"5100 block of West Slauson Ave…	"Ladera Heights"	"Homicide"	-118.36872	33.987604
"Wilson"	"Alvarez"	40	"Male"	"Latino"	"1992-05-23"	"3100 Rosecrans Ave."	"Hawthorne"	"Homicide"	-118.326816	33.901662
"Harry"	"Doller"	56	"Male"	"White"	"1992-05-01"	"3500 block of Winslow Drive"	"Silver Lake"	"Not riot-related"	-118.278763	34.087788
"William"	"Ross"	33	"Male"	"White"	"1992-05-01"	"2882 W. 9th St."	"Koreatown"	"Homicide"	-118.291274	34.05569

Structure

Overview of the DataFrame and its structure:

df.glimpse()

Rows: 63
Columns: 11
$ first_name   <str> 'Cesar A.', 'George', 'Wilson', 'Brian E.', 'Vivian', 'Franklin', 'Carol', 'Patrick', 'Hector', 'Jerel L.'
$ last_name    <str> 'Aguilar', 'Alvarez', 'Alvarez', 'Andrew', 'Austin', 'Benavidez', 'Benson', 'Bettan', 'Castro', 'Channell'
$ age          <i64> 18, 42, 40, 30, 87, 27, 42, 30, 49, 26
$ gender       <str> 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male'
$ race         <str> 'Latino', 'Latino', 'Latino', 'Black', 'Black', 'Latino', 'Black', 'White', 'Latino', 'Black'
$ death_date   <str> '1992-04-30', '1992-05-01', '1992-05-23', '1992-04-30', '1992-05-03', '1992-04-30', '1992-05-02', '1992-04-30', '1992-04-30', '1992-04-30'
$ address      <str> '2009 W. 6th St.', 'Main & College streets', '3100 Rosecrans Ave.', 'Rosecrans & Chester avenues', '1600 W. 60th St.', '4404 S. Western Ave.', 'Harbor Freeway near Slauson Avenue', '2740 W. Olympic Blvd.', 'Vermont & Leeward avenues', 'Santa Monica Boulevard & Seward Street'
$ neighborhood <str> 'Westlake', 'Chinatown', 'Hawthorne', 'Compton', 'Harvard Park', 'Vermont Square', 'South Park', 'Koreatown', 'Koreatown', 'Hollywood'
$ type         <str> 'Officer-involved shooting', 'Not riot-related', 'Homicide', 'Officer-involved shooting', 'Death', 'Officer-involved shooting', 'Death', 'Homicide', 'Homicide', 'Death'
$ longitude    <f64> -118.2739756, -118.2340982, -118.326816, -118.2153903, -118.304741, -118.3088215, -118.2805037, -118.293181, -118.291654, -118.3323783
$ latitude     <f64> 34.0592814, 34.0626901, 33.901662, 33.9034569, 33.985667, 34.0034731, 33.98916756, 34.052068, 34.0587022, 34.09129756

This is similar to the str() function in R.

The list of columns (variable names) can be accessed with the columns attribute:

df.columns

['first_name',
 'last_name',
 'age',
 'gender',
 'race',
 'death_date',
 'address',
 'neighborhood',
 'type',
 'longitude',
 'latitude']

To print a list of the data types of each variable, you can use:

df.dtypes

[String,
 String,
 Int64,
 String,
 String,
 String,
 String,
 String,
 String,
 Float64,
 Float64]

But the printing of a Polars DataFrame already gives you this information (along with the shape).

The schema of a Polars DataFrame sets the names of the variables (columns) and their data types:

df.schema

Schema([('first_name', String),
        ('last_name', String),
        ('age', Int64),
        ('gender', String),
        ('race', String),
        ('death_date', String),
        ('address', String),
        ('neighborhood', String),
        ('type', String),
        ('longitude', Float64),
        ('latitude', Float64)])

Summary statistics

The statistics (mean, standard deviation, min, max, and quartiles) may not be meaningful, depending on your data, but it will always give you one useful piece of information: the number of missing values for each variable. Here, additionally, it is useful for the age variable:

df.describe()

shape: (9, 12)

statistic	first_name	last_name	age	gender	race	death_date	address	neighborhood	type	longitude	latitude
str	str	str	f64	str	str	str	str	str	str	f64	f64
"count"	"63"	"63"	62.0	"63"	"63"	"63"	"63"	"63"	"63"	63.0	63.0
"null_count"	"0"	"0"	1.0	"0"	"0"	"0"	"0"	"0"	"0"	0.0	0.0
"mean"	null	null	32.370968	null	null	null	null	null	null	-118.27991	34.026713
"std"	null	null	14.253253	null	null	null	null	null	null	0.105198	0.098471
"min"	"Aaron"	"Aguilar"	15.0	"Female"	"Asian"	"1992-04-29"	"1005 S. Fresno St."	"Altadena"	"Death"	-118.471745	33.789857
"25%"	null	null	21.0	null	null	null	null	null	null	-118.309822	33.97418
"50%"	null	null	31.0	null	null	null	null	null	null	-118.291495	34.005485
"75%"	null	null	38.0	null	null	null	null	null	null	-118.253197	34.070238
"max"	"Wilson"	"Williams"	87.0	"Male"	"White"	"1993-11-24"	"near North Los Robles Avenue &…	"Westlake"	"Officer-involved shooting"	-117.730647	34.287098