def clean_classes_file(input_filepath, output_filepath):
"""
Remove commas, remove parenthesis, and replace all spaces except the first space
on each line and the space between species name and subcategory info with underscores.
Args:
input_filepath (str): the path of the input file.
output_filepath (str): the path of the output file.
"""
with open(input_filepath, 'r') as infile, \
open(output_filepath, 'w') as outfile:
for line in infile:
# Remove commas and ending parenthesis
cleaned_line = line.replace(',', '').replace(')', '')
# Strip newline characters
cleaned_line = cleaned_line.strip()
# Split line into two parts based on the first space
parts = cleaned_line.split(' ', 1)
# Replace spaces in the second part with underscores
part2_cleaned = parts[1].replace(' ', '_').replace('_(', ' ')
final_line = f'{parts[0]} {part2_cleaned}\n'
outfile.write(final_line)Compiling the metadata
In this section, we process some of the metadata associated with the NABirds dataset by creating a Polars DataFrame collecting all the information we will need while processing the images and training our model. This allows us to get some information about the dataset.
Polars is a modern and ultra fast package that you should use instead of pandas whenever you can if you care about performance.
Metadata files
In addition to images, the dataset comes with a number of text files. To understand the dataset, of course, the place to start is by reading … the README.
You can find it in full in this accordion box, with a summary below:
Each image is associated with a UUID.
We won’t need all the information provided with this dataset for this course, but what we need is contained in the following files:
| Name | Content |
|---|---|
| bounding_boxes.txt | List of UUIDs and their corresponding bounding boxes (one bounding box per image, just around the bird) |
| classes.txt | List of class ids and corresponding class names |
| image_class_labels.txt | List of UUIDs and their corresponding class ids |
| images.txt | List of UUIDs and their corresponding file names |
| photographers.txt | List of UUIDs and their corresponding photographers |
| sizes.txt | List of UUIDs and their corresponding width and height |
| train_test_split.txt | List of UUIDs and 1 or 0 depending on whether the image is for training or validation respectively (the dataset comes with a suggested split) |
The README has one request:
Please be considerate and display the photographer’s name when displaying their image.
We will make sure to follow it.
Clean problem files
Two of the files are problematic because they are jagged: the number of elements per line is inconsistent.
Let’s create cleaning functions that write cleaned up copies of these files.
To clean the classes.txt file:
To clean the photographers.txt file:
def clean_photographer_file(input_filepath, output_filepath):
"""
Remove commas, remove quotes, and replace all spaces except the first space
on each line with underscores.
Args:
input_filepath (str): the path of the input file.
output_filepath (str): the path of the output file.
"""
with open(input_filepath, 'r') as infile, \
open(output_filepath, 'w') as outfile:
for line in infile:
# Remove quotes and commas
cleaned_line = line.replace('"', '').replace(',', '')
# Strip newline characters
cleaned_line = cleaned_line.strip()
# Split line into two parts based on the first space
parts = cleaned_line.split(' ', 1)
# Replace spaces in the second part with underscores
part2_cleaned = parts[1].replace(' ', '_')
final_line = f'{parts[0]} {part2_cleaned}\n'
outfile.write(final_line)Then we can apply the function on our files:
base_dir = '<path-of-the-nabirds-dir>'To be replaced by actual path: in our training cluster, the base_dir is at /project/def-sponsor00/nabirds:
base_dir = '/project/def-sponsor00/nabirds'You will not be able to run the following chunk in the training cluster because I did not give you write access to the dataset. This is on purpose to avoid everyone trying to write to the same file at the same time.
I already created the cleaned files.
import os
clean_photographer_file(
os.path.join(base_dir, 'photographers.txt'),
os.path.join(base_dir, 'photographers_cleaned.txt')
)
clean_classes_file(
os.path.join(base_dir, 'classes.txt'),
os.path.join(base_dir, 'classes_cleaned.txt')
)Create variables
For convenience, let’s create variables with the path of the various files we need:
bb_file = os.path.join(base_dir, 'bounding_boxes.txt')
class_id_to_name_file = os.path.join(base_dir, 'classes_cleaned.txt')
class_id_file = os.path.join(base_dir, 'image_class_labels.txt')
path_file = os.path.join(base_dir, 'images.txt')
photographer_file = os.path.join(base_dir, 'photographers_cleaned.txt')
size_file = os.path.join(base_dir, 'sizes.txt')
train_test_split_file = os.path.join(base_dir, 'train_test_split.txt')Create a DataFrame
Now it’s time to put all the data together in one DataFrame.
First, we create a series of DataFrames from each text file:
import polars as pl
bb = pl.read_csv(
bb_file,
separator=' ',
has_header=False,
new_columns=['UUID', 'bb_x', 'bb_y', 'bb_width', 'bb_height']
)
class_id = pl.read_csv(
class_id_file,
separator=' ',
has_header=False,
new_columns=['UUID', 'class_id']
)The class_id_to_name_file is also fairly complicated: some bird species name are followed by additional information (e.g. “Adult” or “Immature”) in parenthesis. If we want to train a model to identify bird species (rather than classes including the subcategories), we need to separate the two (see also section below for more explanations on this).
An easy way to quickly check what that additional information looks like is to run in the terminal, not in Python:
rg "\(" /project/def-sponsor00/nabirds/classes.txt | fzfIn order to split species name on the one hand and additional information on the other, we need to create 3 columns instead of 2 for this file. The problem is that many rows only have 2 elements (the additional info is not often present).
The shell command above allows to quickly look for the first occurrence of the additional info: it appears at line 295.
Polars scans the first 100 elements by default to determine the schema or mapping for the DataFrame. We need to increase this value to at least 295 to make sure that it detects the 3rd column during the reading in of the file:
class_id_to_name = pl.read_csv(
class_id_to_name_file,
separator=' ',
has_header=False,
infer_schema_length=296,
new_columns=['class_id', 'species', 'subcategory']
)path = pl.read_csv(
path_file,
separator=' ',
has_header=False,
new_columns=['UUID', 'path']
)
photographer = pl.read_csv(
photographer_file,
separator=' ',
has_header=False,
new_columns=['UUID', 'photographer']
)
size = pl.read_csv(
size_file,
separator=' ',
has_header=False,
new_columns=['UUID', 'width', 'height']
)
train_test_split = pl.read_csv(
train_test_split_file,
separator=' ',
has_header=False,
new_columns=['UUID', 'is_training_img']
)We can use polars.read_csv even though we have text files because our files are space separated value files. So they function like CSV files with the exception that we have to set the value of the separator argument to .
Then we can combine the two DataFrames dealing with classes so that the birds identifications becomes directly associated with the birds UUIDs:
classes_metadata = (
class_id.join(class_id_to_name, on='class_id')
)Finally, we combine all the DataFrames:
initial_metadata = (
bb.join(classes_metadata, on='UUID')
.join(path, on='UUID')
.join(photographer, on='UUID')
.join(size, on='UUID')
.join(train_test_split, on='UUID')
)Format strings
We now have all those underscores in the species, subcategory, and photographer columns. We needed them to read in the files properly into DataFrames, but we now want to format those strings properly:
formatted_metadata = initial_metadata.with_columns(
pl.col('species').str.replace_all(r'_', ' ').alias('species_name'),
pl.col('subcategory').str.replace_all(r'_', ' '),
pl.col('photographer').str.replace_all(r'_', ' ')
).drop('species')We are also renaming species to species_name because we will add a species_id in the section below.
Create species mapping
We could use class_id as our training labels. The problem with that is that not all the labels are disjunct. For instance:
class_id175 is thespeciesRed-shouldered Haw with thesubcategoryNoneclass_id359 is thespeciesRed-shouldered Haw with thesubcategoryAdultclass_id658 is thespeciesRed-shouldered Haw with thesubcategoryImmature
And a Red-shouldered Haw is either an immature or an adult. So class_id 175 overlaps with either of the other two. This is not good to train our model.
This emphasizes that you need to know your data very well before you start training a model with it. It is important to spend the time to explore it at length, otherwise the training will fail or perform poorly and you will not understand why.
There is an additional file in this dataset called hierarchy.txt that gives a hierarchy of the various classes. In the example above, this shows that both 658 and 359 fall under the category 175. So we could play with that to solve the problem.
In this exercise though, I decided that I didn’t want to train a model that identifies birds at the level of these categories (Red-shouldered Haw adult vs Red-shouldered Haw immature), but at the level of the species (i.e. simply Red-shouldered Haw).
How you approach this depends on the model you want to create and what exactly you want it to be able to do.
If we use species as the labels, we need to create a mapping for them because models cannot calculate loss on strings—they need numeric labels corresponding to the output neurons of your final layer. So we need to associate each species with an integer. This can be done directly in a Polars DataFrame with a dense ranking:
metadata = formatted_metadata.with_columns(
pl.col('species_name').rank('dense').alias('species_id')
)Sanity checks
Let’s see what our DataFrame looks like:
metadatashape: (48_562, 14)
┌───────────────┬──────┬──────┬──────────┬───┬────────┬───────────────┬───────────────┬────────────┐
│ UUID ┆ bb_x ┆ bb_y ┆ bb_width ┆ … ┆ height ┆ is_training_i ┆ species_name ┆ species_id │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ mg ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ --- ┆ str ┆ u32 │
│ ┆ ┆ ┆ ┆ ┆ ┆ i64 ┆ ┆ │
╞═══════════════╪══════╪══════╪══════════╪═══╪════════╪═══════════════╪═══════════════╪════════════╡
│ 0000139e-21dc ┆ 83 ┆ 59 ┆ 128 ┆ … ┆ 341 ┆ 0 ┆ Oak Titmouse ┆ 260 │
│ -4d0c-bfe1-4c ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ ae3c… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 0000d9fc-4e02 ┆ 328 ┆ 88 ┆ 163 ┆ … ┆ 427 ┆ 0 ┆ Ovenbird ┆ 264 │
│ -4c06-a0af-a5 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 5cfb… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 00019306-9d83 ┆ 174 ┆ 367 ┆ 219 ┆ … ┆ 1024 ┆ 0 ┆ Savannah ┆ 322 │
│ -4334-b255-a4 ┆ ┆ ┆ ┆ ┆ ┆ ┆ Sparrow ┆ │
│ 4774… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 0001afd4-99a1 ┆ 307 ┆ 179 ┆ 492 ┆ … ┆ 680 ┆ 1 ┆ Eared Grebe ┆ 145 │
│ -4a67-b940-d4 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 1941… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 000332b8-997c ┆ 395 ┆ 139 ┆ 262 ┆ … ┆ 682 ┆ 0 ┆ Eastern ┆ 149 │
│ -4540-9647-2f ┆ ┆ ┆ ┆ ┆ ┆ ┆ Phoebe ┆ │
│ 0a84… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ fff86e8b-795f ┆ 344 ┆ 163 ┆ 291 ┆ … ┆ 819 ┆ 1 ┆ Canyon Towhee ┆ 101 │
│ -400a-91e8-56 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 5bbb… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ fff926d7-ccad ┆ 330 ┆ 180 ┆ 339 ┆ … ┆ 956 ┆ 1 ┆ Rough-legged ┆ 310 │
│ -4788-839e-97 ┆ ┆ ┆ ┆ ┆ ┆ ┆ Hawk ┆ │
│ af2d… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ fffa33ef-a765 ┆ 184 ┆ 94 ┆ 258 ┆ … ┆ 800 ┆ 1 ┆ Swallow-taile ┆ 345 │
│ -408d-8d66-6e ┆ ┆ ┆ ┆ ┆ ┆ ┆ d Kite ┆ │
│ fc7f… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ ffff0d87-bc84 ┆ 102 ┆ 210 ┆ 461 ┆ … ┆ 1024 ┆ 0 ┆ Broad-billed ┆ 77 │
│ -4ef2-a47e-a4 ┆ ┆ ┆ ┆ ┆ ┆ ┆ Hummingbird ┆ │
│ bfa4… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ fffff3a5-2a75 ┆ 281 ┆ 164 ┆ 524 ┆ … ┆ 683 ┆ 0 ┆ Black-throate ┆ 57 │
│ -47d0-887f-03 ┆ ┆ ┆ ┆ ┆ ┆ ┆ d Gray ┆ │
│ 871e… ┆ ┆ ┆ ┆ ┆ ┆ ┆ Warbler ┆ │
└───────────────┴──────┴──────┴──────────┴───┴────────┴───────────────┴───────────────┴────────────┘
And then let’s explore a number of characteristics:
print(metadata.columns)
print(metadata.row(0))
print(metadata.row(-1))['UUID', 'bb_x', 'bb_y', 'bb_width', 'bb_height', 'class_id', 'subcategory', 'path', 'photographer', 'width', 'height', 'is_training_img', 'species_name', 'species_id']
('0000139e-21dc-4d0c-bfe1-4cae3c85c829', 83, 59, 128, 228, 817, None, '0817/0000139e21dc4d0cbfe14cae3c85c829.jpg', 'Ruth Cantwell', 296, 341, 0, 'Oak Titmouse', 260)
('fffff3a5-2a75-47d0-887f-03871e3f9a37', 281, 164, 524, 279, 880, None, '0880/fffff3a52a7547d0887f03871e3f9a37.jpg', 'Dominic Sherony', 1024, 683, 0, 'Black-throated Gray Warbler', 57)
metadata.head()shape: (5, 14)
┌───────────────┬──────┬──────┬──────────┬───┬────────┬───────────────┬───────────────┬────────────┐
│ UUID ┆ bb_x ┆ bb_y ┆ bb_width ┆ … ┆ height ┆ is_training_i ┆ species_name ┆ species_id │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ mg ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ --- ┆ str ┆ u32 │
│ ┆ ┆ ┆ ┆ ┆ ┆ i64 ┆ ┆ │
╞═══════════════╪══════╪══════╪══════════╪═══╪════════╪═══════════════╪═══════════════╪════════════╡
│ 0000139e-21dc ┆ 83 ┆ 59 ┆ 128 ┆ … ┆ 341 ┆ 0 ┆ Oak Titmouse ┆ 260 │
│ -4d0c-bfe1-4c ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ ae3c… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 0000d9fc-4e02 ┆ 328 ┆ 88 ┆ 163 ┆ … ┆ 427 ┆ 0 ┆ Ovenbird ┆ 264 │
│ -4c06-a0af-a5 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 5cfb… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 00019306-9d83 ┆ 174 ┆ 367 ┆ 219 ┆ … ┆ 1024 ┆ 0 ┆ Savannah ┆ 322 │
│ -4334-b255-a4 ┆ ┆ ┆ ┆ ┆ ┆ ┆ Sparrow ┆ │
│ 4774… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 0001afd4-99a1 ┆ 307 ┆ 179 ┆ 492 ┆ … ┆ 680 ┆ 1 ┆ Eared Grebe ┆ 145 │
│ -4a67-b940-d4 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 1941… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 000332b8-997c ┆ 395 ┆ 139 ┆ 262 ┆ … ┆ 682 ┆ 0 ┆ Eastern ┆ 149 │
│ -4540-9647-2f ┆ ┆ ┆ ┆ ┆ ┆ ┆ Phoebe ┆ │
│ 0a84… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
└───────────────┴──────┴──────┴──────────┴───┴────────┴───────────────┴───────────────┴────────────┘
metadata.tail()shape: (5, 14)
┌───────────────┬──────┬──────┬──────────┬───┬────────┬───────────────┬───────────────┬────────────┐
│ UUID ┆ bb_x ┆ bb_y ┆ bb_width ┆ … ┆ height ┆ is_training_i ┆ species_name ┆ species_id │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ mg ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ --- ┆ str ┆ u32 │
│ ┆ ┆ ┆ ┆ ┆ ┆ i64 ┆ ┆ │
╞═══════════════╪══════╪══════╪══════════╪═══╪════════╪═══════════════╪═══════════════╪════════════╡
│ fff86e8b-795f ┆ 344 ┆ 163 ┆ 291 ┆ … ┆ 819 ┆ 1 ┆ Canyon Towhee ┆ 101 │
│ -400a-91e8-56 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 5bbb… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ fff926d7-ccad ┆ 330 ┆ 180 ┆ 339 ┆ … ┆ 956 ┆ 1 ┆ Rough-legged ┆ 310 │
│ -4788-839e-97 ┆ ┆ ┆ ┆ ┆ ┆ ┆ Hawk ┆ │
│ af2d… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ fffa33ef-a765 ┆ 184 ┆ 94 ┆ 258 ┆ … ┆ 800 ┆ 1 ┆ Swallow-taile ┆ 345 │
│ -408d-8d66-6e ┆ ┆ ┆ ┆ ┆ ┆ ┆ d Kite ┆ │
│ fc7f… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ ffff0d87-bc84 ┆ 102 ┆ 210 ┆ 461 ┆ … ┆ 1024 ┆ 0 ┆ Broad-billed ┆ 77 │
│ -4ef2-a47e-a4 ┆ ┆ ┆ ┆ ┆ ┆ ┆ Hummingbird ┆ │
│ bfa4… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ fffff3a5-2a75 ┆ 281 ┆ 164 ┆ 524 ┆ … ┆ 683 ┆ 0 ┆ Black-throate ┆ 57 │
│ -47d0-887f-03 ┆ ┆ ┆ ┆ ┆ ┆ ┆ d Gray ┆ │
│ 871e… ┆ ┆ ┆ ┆ ┆ ┆ ┆ Warbler ┆ │
└───────────────┴──────┴──────┴──────────┴───┴────────┴───────────────┴───────────────┴────────────┘
import random
random.seed(123)
metadata.sample()shape: (1, 14)
┌───────────────┬──────┬──────┬──────────┬───┬────────┬───────────────┬───────────────┬────────────┐
│ UUID ┆ bb_x ┆ bb_y ┆ bb_width ┆ … ┆ height ┆ is_training_i ┆ species_name ┆ species_id │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ mg ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ --- ┆ str ┆ u32 │
│ ┆ ┆ ┆ ┆ ┆ ┆ i64 ┆ ┆ │
╞═══════════════╪══════╪══════╪══════════╪═══╪════════╪═══════════════╪═══════════════╪════════════╡
│ b20cc001-80f0 ┆ 382 ┆ 236 ┆ 308 ┆ … ┆ 723 ┆ 0 ┆ Red-winged ┆ 300 │
│ -4280-9cd5-b9 ┆ ┆ ┆ ┆ ┆ ┆ ┆ Blackbird ┆ │
│ b569… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
└───────────────┴──────┴──────┴──────────┴───┴────────┴───────────────┴───────────────┴────────────┘
print(metadata.schema)
print(metadata.shape)Schema({'UUID': String, 'bb_x': Int64, 'bb_y': Int64, 'bb_width': Int64, 'bb_height': Int64, 'class_id': Int64, 'subcategory': String, 'path': String, 'photographer': String, 'width': Int64, 'height': Int64, 'is_training_img': Int64, 'species_name': String, 'species_id': UInt32})
(48562, 14)
print(metadata.glimpse())Rows: 48562
Columns: 14
$ UUID <str> '0000139e-21dc-4d0c-bfe1-4cae3c85c829', '0000d9fc-4e02-4c06-a0af-a55cfb16b12b', '00019306-9d83-4334-b255-a447742edce3', '0001afd4-99a1-4a67-b940-d419413e23b3', '000332b8-997c-4540-9647-2f0a8495aecf', '000343bd-5215-49ba-ab9c-7c97a70ac1a5', '0004ff8d-0cc8-47ee-94ba-43352a8b9eb4', '0007181f-a727-4481-ad89-591200c61b9d', '00071e20-8156-4bd8-b5ca-6445c2560ee5', '0007acfc-c0e6-4393-9ab6-02215a82ef63'
$ bb_x <i64> 83, 328, 174, 307, 395, 120, 417, 47, 260, 193
$ bb_y <i64> 59, 88, 367, 179, 139, 210, 109, 194, 146, 291
$ bb_width <i64> 128, 163, 219, 492, 262, 587, 221, 819, 578, 526
$ bb_height <i64> 228, 298, 378, 224, 390, 357, 467, 573, 516, 145
$ class_id <i64> 817, 860, 900, 645, 929, 652, 951, 900, 988, 400
$ subcategory <str> null, null, null, 'Nonbreeding/juvenile', null, 'Immature', null, null, 'Female/Immature Male', 'Adult'
$ path <str> '0817/0000139e21dc4d0cbfe14cae3c85c829.jpg', '0860/0000d9fc4e024c06a0afa55cfb16b12b.jpg', '0900/000193069d834334b255a447742edce3.jpg', '0645/0001afd499a14a67b940d419413e23b3.jpg', '0929/000332b8997c454096472f0a8495aecf.jpg', '0652/000343bd521549baab9c7c97a70ac1a5.jpg', '0951/0004ff8d0cc847ee94ba43352a8b9eb4.jpg', '0900/0007181fa7274481ad89591200c61b9d.jpg', '0988/00071e2081564bd8b5ca6445c2560ee5.jpg', '0400/0007acfcc0e643939ab602215a82ef63.jpg'
$ photographer <str> 'Ruth Cantwell', 'Christopher L. Wood Chris Wood', 'Ryan Schain', 'Laura Erickson', 'Dan Irizarry', 'Ken Schneider', 'Velma Knowles', 'Matt Tillett', 'Terry Gray', 'Cory Gregory'
$ width <i64> 296, 640, 730, 1024, 1024, 1024, 1024, 1024, 1024, 1024
$ height <i64> 341, 427, 1024, 680, 682, 768, 683, 819, 768, 681
$ is_training_img <i64> 0, 0, 0, 1, 0, 0, 0, 1, 1, 0
$ species_name <str> 'Oak Titmouse', 'Ovenbird', 'Savannah Sparrow', 'Eared Grebe', 'Eastern Phoebe', 'Yellow-crowned Night-Heron', 'Florida Scrub-Jay', 'Savannah Sparrow', 'Yellow-headed Blackbird', 'Herring Gull'
$ species_id <u32> 260, 264, 322, 145, 149, 401, 158, 322, 402, 195
None
metadata.describe()shape: (9, 15)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ statistic ┆ UUID ┆ bb_x ┆ bb_y ┆ … ┆ height ┆ is_traini ┆ species_n ┆ species_ │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ ng_img ┆ ame ┆ id │
│ str ┆ str ┆ f64 ┆ f64 ┆ ┆ f64 ┆ --- ┆ --- ┆ --- │
│ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ str ┆ f64 │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ count ┆ 48562 ┆ 48562.0 ┆ 48562.0 ┆ … ┆ 48562.0 ┆ 48562.0 ┆ 48562 ┆ 48562.0 │
│ null_coun ┆ 0 ┆ 0.0 ┆ 0.0 ┆ … ┆ 0.0 ┆ 0.0 ┆ 0 ┆ 0.0 │
│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ mean ┆ null ┆ 221.68531 ┆ 158.53412 ┆ … ┆ 712.33555 ┆ 0.492752 ┆ null ┆ 204.6341 │
│ ┆ ┆ ┆ 1 ┆ ┆ ┆ ┆ ┆ 17 │
│ std ┆ null ┆ 133.05486 ┆ 80.976264 ┆ … ┆ 152.49441 ┆ 0.499953 ┆ null ┆ 116.4767 │
│ ┆ ┆ 4 ┆ ┆ ┆ ┆ ┆ ┆ 52 │
│ min ┆ 0000139e- ┆ 0.0 ┆ 0.0 ┆ … ┆ 98.0 ┆ 0.0 ┆ Abert's ┆ 1.0 │
│ ┆ 21dc-4d0c ┆ ┆ ┆ ┆ ┆ ┆ Towhee ┆ │
│ ┆ -bfe1-4ca ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ ┆ e3c… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 25% ┆ null ┆ 115.0 ┆ 99.0 ┆ … ┆ 639.0 ┆ 0.0 ┆ null ┆ 107.0 │
│ 50% ┆ null ┆ 205.0 ┆ 149.0 ┆ … ┆ 683.0 ┆ 0.0 ┆ null ┆ 204.0 │
│ 75% ┆ null ┆ 315.0 ┆ 208.0 ┆ … ┆ 780.0 ┆ 1.0 ┆ null ┆ 303.0 │
│ max ┆ fffff3a5- ┆ 837.0 ┆ 799.0 ┆ … ┆ 1024.0 ┆ 1.0 ┆ Yellow-th ┆ 405.0 │
│ ┆ 2a75-47d0 ┆ ┆ ┆ ┆ ┆ ┆ roated ┆ │
│ ┆ -887f-038 ┆ ┆ ┆ ┆ ┆ ┆ Warbler ┆ │
│ ┆ 71e… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘
Learn about the data
Now that we have the metadata organized, let’s get to know our data:
print(f"There are {len(metadata)} images in the dataset.")There are 48562 images in the dataset.
metadata_train = metadata.filter(pl.col('is_training_img') == 1)
print(f"""
There are:
- {len(metadata_train)} images in the training set,
- {len(metadata) - len(metadata_train)} in the validation set.
""")
There are:
- 23929 images in the training set,
- 24633 in the validation set.
class_id = metadata.unique(pl.col('class_id'))
species = metadata.unique(pl.col('species_id'))
print(f"There are {len(class_id)} different classes and {len(species)} different species in the dataset.")There are 555 different classes and 405 different species in the dataset.
train_class_id_group_length = metadata_train.group_by(pl.col('class_id')).len()
print(f"""
The number of images per class in the training set varies from {train_class_id_group_length.select(pl.min('len')).item()} to {train_class_id_group_length.select(pl.max('len')).item()},
with an average of {round(train_class_id_group_length.select(pl.mean('len')).item())} images per class.
""")
The number of images per class in the training set varies from 4 to 60,
with an average of 43 images per class.
train_species_group_length = metadata_train.group_by(pl.col('species_id')).len()
print(f"""
The number of images per species in the training set varies from {train_species_group_length.select(pl.min('len')).item()} to {train_species_group_length.select(pl.max('len')).item()},
with an average of {round(train_species_group_length.select(pl.mean('len')).item())} images per species.
""")
The number of images per species in the training set varies from 6 to 221,
with an average of 59 images per species.
subcategory = metadata.unique(pl.col("subcategory"))
example_list = subcategory.get_column('subcategory').drop_nulls().head(10).to_list()
example_list_cleaned = [x.replace('_', ' ') for x in example_list]
print(f"""
There are {len(subcategory)} species subcategories, such as:
- {'\n- '.join(example_list_cleaned)}
- etc.
""")
There are 61 species subcategories, such as:
- Nonbreeding Adult
- Female/Immature Male
- Adult
- Nonbreeding/juvenile
- Adult Subadult
- Female/Nonbreeding male
- Breeding Audubon's
- Immature/Juvenile
- Winter/juvenile Myrtle
- Breeding Adult
- etc.
print(f"""
The images widths vary from {metadata.select(pl.min('width')).item()} to {metadata.select(pl.max('width')).item()}, with a mean of {round(metadata.select(pl.mean('width')).item())}
while the heights vary from {metadata.select(pl.min('height')).item()} to {metadata.select(pl.max('height')).item()} with a mean of {round(metadata.select(pl.mean('height')).item())}.
""")
The images widths vary from 90 to 1024, with a mean of 899
while the heights vary from 98 to 1024 with a mean of 712.
Summary metadata
Let’s summarize the info we gathered from the metadata:
| Category | Value |
|---|---|
| Images | 48_562 |
| Training images | 23_929 |
| Validation images | 24_633 |
| Classes (species with their subcategories) | 555 |
| Species | 405 |
| Average number of images per class in the training set | 43 |
| Average number of images per species in the training set | 59 |
| Images min width (px) | 90 |
| Images max width (px) | 1024 |
| Images mean width (px) | 899 |
| Images min height (px) | 98 |
| Images max height (px) | 1024 |
| Images mean height (px) | 712 |
Save DataFrame to Parquet
To make it easier to retrieve information from the metadata later on, we can save the DataFrame to file.
Parquet is an open-source, columnar, and extremely efficient binary file format for tabular data. Unlike in CSV or JSON files, the data is compressed, making it efficient for storage space. It is also excellent for query performance. Always prefer it over text-based formats.
metadata.write_parquet('metadata.parquet')Our metadata is ready. We can now start working with the pictures.