DataFrames inspection

Author

Marie-Hélène Burle

Once we have a DataFrame, it is important to quickly get some basic information about it. In this section, we will see how to do so.

Let’s use the la_riots dataset, an open-source dataset on fatalities during the civil unrest in Los Angeles in April and May 1992, provided by the plotting library Vega-Altair. The dataset is hosted online as a CSV file.

You can read in a CSV file (local or from the Internet) with polars.read_csv:

import polars as pl

df = pl.read_csv("https://cdn.jsdelivr.net/npm/vega-datasets/data/la-riots.csv")

df
shape: (63, 11)
first_name last_name age gender race death_date address neighborhood type longitude latitude
str str i64 str str str str str str f64 f64
"Cesar A." "Aguilar" 18 "Male" "Latino" "1992-04-30" "2009 W. 6th St." "Westlake" "Officer-involved shooting" -118.273976 34.059281
"George" "Alvarez" 42 "Male" "Latino" "1992-05-01" "Main & College streets" "Chinatown" "Not riot-related" -118.234098 34.06269
"Wilson" "Alvarez" 40 "Male" "Latino" "1992-05-23" "3100 Rosecrans Ave." "Hawthorne" "Homicide" -118.326816 33.901662
"Brian E." "Andrew" 30 "Male" "Black" "1992-04-30" "Rosecrans & Chester avenues" "Compton" "Officer-involved shooting" -118.21539 33.903457
"Vivian" "Austin" 87 "Female" "Black" "1992-05-03" "1600 W. 60th St." "Harvard Park" "Death" -118.304741 33.985667
"Fredrick" "Ward" 20 "Male" "Black" "1992-05-02" "11932 Cometa Ave." "Pacoima" "Homicide" -118.412778 34.287098
"Louis A." "Watson" 18 "Male" "Black" "1992-04-29" "4365 S. Vermont Ave." "Vermont Square" "Homicide" -118.291557 34.005244
"Elbert O." "Wilkins" 33 "Male" "Black" "1992-04-30" "Western Avenue & 92nd Street" "Gramercy Park" "Homicide" -118.310004 33.952767
"John H." "Willers" 37 "Male" "White" "1992-04-29" "10621 Sepulveda Blvd." "Mission Hills" "Homicide" -118.46777 34.263184
"Willie Bernard" "Williams" 29 "Male" "Black" "1992-04-29" "Gage & Western avenues" "Chesterfield Square" "Death" -118.308952 33.982363


Printing a few rows

Print first rows (5 by default):

df.head()
shape: (5, 11)
first_name last_name age gender race death_date address neighborhood type longitude latitude
str str i64 str str str str str str f64 f64
"Cesar A." "Aguilar" 18 "Male" "Latino" "1992-04-30" "2009 W. 6th St." "Westlake" "Officer-involved shooting" -118.273976 34.059281
"George" "Alvarez" 42 "Male" "Latino" "1992-05-01" "Main & College streets" "Chinatown" "Not riot-related" -118.234098 34.06269
"Wilson" "Alvarez" 40 "Male" "Latino" "1992-05-23" "3100 Rosecrans Ave." "Hawthorne" "Homicide" -118.326816 33.901662
"Brian E." "Andrew" 30 "Male" "Black" "1992-04-30" "Rosecrans & Chester avenues" "Compton" "Officer-involved shooting" -118.21539 33.903457
"Vivian" "Austin" 87 "Female" "Black" "1992-05-03" "1600 W. 60th St." "Harvard Park" "Death" -118.304741 33.985667


df.head(2)
shape: (2, 11)
first_name last_name age gender race death_date address neighborhood type longitude latitude
str str i64 str str str str str str f64 f64
"Cesar A." "Aguilar" 18 "Male" "Latino" "1992-04-30" "2009 W. 6th St." "Westlake" "Officer-involved shooting" -118.273976 34.059281
"George" "Alvarez" 42 "Male" "Latino" "1992-05-01" "Main & College streets" "Chinatown" "Not riot-related" -118.234098 34.06269


Print last rows (5 by default):

df.tail(2)
shape: (2, 11)
first_name last_name age gender race death_date address neighborhood type longitude latitude
str str i64 str str str str str str f64 f64
"John H." "Willers" 37 "Male" "White" "1992-04-29" "10621 Sepulveda Blvd." "Mission Hills" "Homicide" -118.46777 34.263184
"Willie Bernard" "Williams" 29 "Male" "Black" "1992-04-29" "Gage & Western avenues" "Chesterfield Square" "Death" -118.308952 33.982363


Print random rows (this is very useful as the head and tail of your DataFrame may not be representative of your data):

import random

df.sample(4)
shape: (4, 11)
first_name last_name age gender race death_date address neighborhood type longitude latitude
str str i64 str str str str str str f64 f64
"Eduardo C." "Vela" 33 "Male" "Latino" "1992-04-29" "5100 block of West Slauson Ave… "Ladera Heights" "Homicide" -118.36872 33.987604
"Wilson" "Alvarez" 40 "Male" "Latino" "1992-05-23" "3100 Rosecrans Ave." "Hawthorne" "Homicide" -118.326816 33.901662
"Harry" "Doller" 56 "Male" "White" "1992-05-01" "3500 block of Winslow Drive" "Silver Lake" "Not riot-related" -118.278763 34.087788
"William" "Ross" 33 "Male" "White" "1992-05-01" "2882 W. 9th St." "Koreatown" "Homicide" -118.291274 34.05569

Structure

Overview of the DataFrame and its structure:

df.glimpse()
Rows: 63
Columns: 11
$ first_name   <str> 'Cesar A.', 'George', 'Wilson', 'Brian E.', 'Vivian', 'Franklin', 'Carol', 'Patrick', 'Hector', 'Jerel L.'
$ last_name    <str> 'Aguilar', 'Alvarez', 'Alvarez', 'Andrew', 'Austin', 'Benavidez', 'Benson', 'Bettan', 'Castro', 'Channell'
$ age          <i64> 18, 42, 40, 30, 87, 27, 42, 30, 49, 26
$ gender       <str> 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male'
$ race         <str> 'Latino', 'Latino', 'Latino', 'Black', 'Black', 'Latino', 'Black', 'White', 'Latino', 'Black'
$ death_date   <str> '1992-04-30', '1992-05-01', '1992-05-23', '1992-04-30', '1992-05-03', '1992-04-30', '1992-05-02', '1992-04-30', '1992-04-30', '1992-04-30'
$ address      <str> '2009 W. 6th St.', 'Main & College streets', '3100 Rosecrans Ave.', 'Rosecrans & Chester avenues', '1600 W. 60th St.', '4404 S. Western Ave.', 'Harbor Freeway near Slauson Avenue', '2740 W. Olympic Blvd.', 'Vermont & Leeward avenues', 'Santa Monica Boulevard & Seward Street'
$ neighborhood <str> 'Westlake', 'Chinatown', 'Hawthorne', 'Compton', 'Harvard Park', 'Vermont Square', 'South Park', 'Koreatown', 'Koreatown', 'Hollywood'
$ type         <str> 'Officer-involved shooting', 'Not riot-related', 'Homicide', 'Officer-involved shooting', 'Death', 'Officer-involved shooting', 'Death', 'Homicide', 'Homicide', 'Death'
$ longitude    <f64> -118.2739756, -118.2340982, -118.326816, -118.2153903, -118.304741, -118.3088215, -118.2805037, -118.293181, -118.291654, -118.3323783
$ latitude     <f64> 34.0592814, 34.0626901, 33.901662, 33.9034569, 33.985667, 34.0034731, 33.98916756, 34.052068, 34.0587022, 34.09129756

This is similar to the str() function in R.

The list of columns (variable names) can be accessed with the columns attribute:

df.columns
['first_name',
 'last_name',
 'age',
 'gender',
 'race',
 'death_date',
 'address',
 'neighborhood',
 'type',
 'longitude',
 'latitude']

To print a list of the data types of each variable, you can use:

df.dtypes
[String,
 String,
 Int64,
 String,
 String,
 String,
 String,
 String,
 String,
 Float64,
 Float64]

But the printing of a Polars DataFrame already gives you this information (along with the shape).

The schema of a Polars DataFrame sets the names of the variables (columns) and their data types:

df.schema
Schema([('first_name', String),
        ('last_name', String),
        ('age', Int64),
        ('gender', String),
        ('race', String),
        ('death_date', String),
        ('address', String),
        ('neighborhood', String),
        ('type', String),
        ('longitude', Float64),
        ('latitude', Float64)])

Summary statistics

The statistics (mean, standard deviation, min, max, and quartiles) may not be meaningful, depending on your data, but it will always give you one useful piece of information: the number of missing values for each variable. Here, additionally, it is useful for the age variable:

df.describe()
shape: (9, 12)
statistic first_name last_name age gender race death_date address neighborhood type longitude latitude
str str str f64 str str str str str str f64 f64
"count" "63" "63" 62.0 "63" "63" "63" "63" "63" "63" 63.0 63.0
"null_count" "0" "0" 1.0 "0" "0" "0" "0" "0" "0" 0.0 0.0
"mean" null null 32.370968 null null null null null null -118.27991 34.026713
"std" null null 14.253253 null null null null null null 0.105198 0.098471
"min" "Aaron" "Aguilar" 15.0 "Female" "Asian" "1992-04-29" "1005 S. Fresno St." "Altadena" "Death" -118.471745 33.789857
"25%" null null 21.0 null null null null null null -118.309822 33.97418
"50%" null null 31.0 null null null null null null -118.291495 34.005485
"75%" null null 38.0 null null null null null null -118.253197 34.070238
"max" "Wilson" "Williams" 87.0 "Male" "White" "1993-11-24" "near North Los Robles Avenue &… "Westlake" "Officer-involved shooting" -117.730647 34.287098