def clean_classes_file(input_filepath, output_filepath):
"""
Remove commas, remove parenthesis, and replace all spaces except the first space
on each line and the space between species name and subcategory info with underscores.
Args:
input_filepath (str): the path of the input file.
output_filepath (str): the path of the output file.
"""
with open(input_filepath, 'r') as infile, \
open(output_filepath, 'w') as outfile:
for line in infile:
# Remove commas and ending parenthesis
cleaned_line = line.replace(',', '').replace(')', '')
# Strip newline characters
cleaned_line = cleaned_line.strip()
# Split line into two parts based on the first space
parts = cleaned_line.split(' ', 1)
# Replace spaces in the second part with underscores
part2_cleaned = parts[1].replace(' ', '_').replace('_(', ' ')
final_line = f'{parts[0]} {part2_cleaned}\n'
outfile.write(final_line)Compiling the metadata
In this section, we process some of the metadata associated with the NABirds dataset by creating a Polars DataFrame collecting all the information we will need while processing the images and training our model. This allows us to get some information about the dataset.
Polars is a modern and ultra fast package that you should use instead of pandas whenever you can if you care about performance.
Metadata files
In addition to images, the dataset comes with a number of text files. To understand the dataset, of course, the place to start is by reading … the README.
You can find it in full in this accordion box, with a summary below:
Each image is associated with a UUID.
We won’t need all the information provided with this dataset for this course, but what we need is contained in the following files:
| Name | Content |
|---|---|
| bounding_boxes.txt | List of UUIDs and their corresponding bounding boxes (one bounding box per image, just around the bird) |
| classes.txt | List of class ids and corresponding class names |
| image_class_labels.txt | List of UUIDs and their corresponding class ids |
| images.txt | List of UUIDs and their corresponding file names |
| photographers.txt | List of UUIDs and their corresponding photographers |
| train_test_split.txt | List of UUIDs and 1 or 0 depending on whether the image is for training or validation respectively (the dataset comes with a suggested split) |
The README has one request:
Please be considerate and display the photographer’s name when displaying their image.
We will make sure to follow it.
Clean problem files
Two of the files are problematic because they are jagged: the number of elements per line is inconsistent.
Let’s create a cleaning function that writes cleaned up copies of the files:
def clean_photographer_file(input_filepath, output_filepath):
"""
Remove commas, remove quotes, and replace all spaces except the first space
on each line with underscores.
Args:
input_filepath (str): the path of the input file.
output_filepath (str): the path of the output file.
"""
with open(input_filepath, 'r') as infile, \
open(output_filepath, 'w') as outfile:
for line in infile:
# Remove quotes and commas
cleaned_line = line.replace('"', '').replace(',', '')
# Strip newline characters
cleaned_line = cleaned_line.strip()
# Split line into two parts based on the first space
parts = cleaned_line.split(' ', 1)
# Replace spaces in the second part with underscores
part2_cleaned = parts[1].replace(' ', '_')
final_line = f'{parts[0]} {part2_cleaned}\n'
outfile.write(final_line)Then we can apply the function on our files:
base_dir = '<path-of-the-nabirds-dir>'To be replaced by actual path: in our training cluster, the base_dir is at /project/def-sponsor00/nabirds:
base_dir = '/project/def-sponsor00/nabirds'You will not be able to run the following chunk in the training cluster because I did not give you write access to the dataset. This is on purpose to avoid everyone trying to write to the same file at the same time.
I already created the cleaned files.
import os
clean_photographer_file(
os.path.join(base_dir, 'photographers.txt'),
os.path.join(base_dir, 'photographers_cleaned.txt')
)
clean_classes_file(
os.path.join(base_dir, 'classes.txt'),
os.path.join(base_dir, 'classes_cleaned.txt')
)Create variables
For convenience, let’s create variables with the path of the various files we need:
bb_file = os.path.join(base_dir, 'bounding_boxes.txt')
class_id_to_name_file = os.path.join(base_dir, 'classes_cleaned.txt')
class_id_file = os.path.join(base_dir, 'image_class_labels.txt')
path_file = os.path.join(base_dir, 'images.txt')
photographer_file = os.path.join(base_dir, 'photographers_cleaned.txt')
train_test_split_file = os.path.join(base_dir, 'train_test_split.txt')Create a metadata DataFrame
Now it’s time to put all the data together in one DataFrame.
First, we create a series of DataFrames from each text file:
import polars as pl
bb = pl.read_csv(
bb_file,
separator=' ',
has_header=False,
new_columns=['UUID', 'bb_x', 'bb_y', 'bb_width', 'bb_height']
)
class_id = pl.read_csv(
class_id_file,
separator=' ',
has_header=False,
new_columns=['UUID', 'class_id']
)The class_id_to_name_file is also fairly complicated: some bird species name are followed by additional information (e.g. “Adult” or “Immature”) in parenthesis. If we want to train a model to identify bird species (rather than classes including the subcategories), we need to separate the two.
An easy way to quickly check what that additional information looks like is to run in the terminal, not in Python:
rg "\(" /project/def-sponsor00/nabirds/classes.txt | fzfIn order to split species name on the one hand and additional information on the other, we need to create 3 columns instead of 2 for this file. The problem is that many rows only have 2 elements (the additional info is not often present).
The shell command above allows to quickly look for the first occurrence of the additional info: it appears at line 295.
Polars scans the first 100 elements by default to determine the schema or mapping for the DataFrame. We need to increase this value to at least 295 to make sure that it detects the 3rd column during the reading in of the file:
class_id_to_name = pl.read_csv(
class_id_to_name_file,
separator=' ',
has_header=False,
infer_schema_length=296,
new_columns=['class_id', 'species', 'subcategory']
)path = pl.read_csv(
path_file,
separator=' ',
has_header=False,
new_columns=['UUID', 'path']
)
photographer = pl.read_csv(
photographer_file,
separator=' ',
has_header=False,
new_columns=['UUID', 'photographer']
)
train_test_split = pl.read_csv(
train_test_split_file,
separator=' ',
has_header=False,
new_columns=['UUID', 'is_training_img']
)We can use polars.read_csv even though we have text files because our files are space separated value files. So they function like CSV files with the exception that we have to set the value of the separator argument to .
Then we can combine the two DataFrames dealing with classes so that the birds identifications becomes directly associated with the birds UUIDs:
classes_metadata = (
class_id.join(class_id_to_name, on='class_id')
)Finally, we combine all the DataFrames:
metadata = (
bb.join(classes_metadata, on='UUID')
.join(path, on='UUID')
.join(photographer, on='UUID')
.join(train_test_split, on='UUID')
)Sanity checks
Let’s see what our DataFrame looks like:
metadatashape: (48_562, 11)
┌─────────────┬──────┬──────┬──────────┬───┬─────────────┬─────────────┬─────────────┬─────────────┐
│ UUID ┆ bb_x ┆ bb_y ┆ bb_width ┆ … ┆ subcategory ┆ path ┆ photographe ┆ is_training │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ r ┆ _img │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ str ┆ str ┆ --- ┆ --- │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ str ┆ i64 │
╞═════════════╪══════╪══════╪══════════╪═══╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 0000139e-21 ┆ 83 ┆ 59 ┆ 128 ┆ … ┆ null ┆ 0817/000013 ┆ Ruth_Cantwe ┆ 0 │
│ dc-4d0c-bfe ┆ ┆ ┆ ┆ ┆ ┆ 9e21dc4d0cb ┆ ll ┆ │
│ 1-4cae3c… ┆ ┆ ┆ ┆ ┆ ┆ fe14cae3… ┆ ┆ │
│ 0000d9fc-4e ┆ 328 ┆ 88 ┆ 163 ┆ … ┆ null ┆ 0860/0000d9 ┆ Christopher ┆ 0 │
│ 02-4c06-a0a ┆ ┆ ┆ ┆ ┆ ┆ fc4e024c06a ┆ _L._Wood_Ch ┆ │
│ f-a55cfb… ┆ ┆ ┆ ┆ ┆ ┆ 0afa55cf… ┆ ris_Wood ┆ │
│ 00019306-9d ┆ 174 ┆ 367 ┆ 219 ┆ … ┆ null ┆ 0900/000193 ┆ Ryan_Schain ┆ 0 │
│ 83-4334-b25 ┆ ┆ ┆ ┆ ┆ ┆ 069d834334b ┆ ┆ │
│ 5-a44774… ┆ ┆ ┆ ┆ ┆ ┆ 255a4477… ┆ ┆ │
│ 0001afd4-99 ┆ 307 ┆ 179 ┆ 492 ┆ … ┆ Nonbreeding ┆ 0645/0001af ┆ Laura_Erick ┆ 1 │
│ a1-4a67-b94 ┆ ┆ ┆ ┆ ┆ /juvenile ┆ d499a14a67b ┆ son ┆ │
│ 0-d41941… ┆ ┆ ┆ ┆ ┆ ┆ 940d4194… ┆ ┆ │
│ 000332b8-99 ┆ 395 ┆ 139 ┆ 262 ┆ … ┆ null ┆ 0929/000332 ┆ Dan_Irizarr ┆ 0 │
│ 7c-4540-964 ┆ ┆ ┆ ┆ ┆ ┆ b8997c45409 ┆ y ┆ │
│ 7-2f0a84… ┆ ┆ ┆ ┆ ┆ ┆ 6472f0a8… ┆ ┆ │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ fff86e8b-79 ┆ 344 ┆ 163 ┆ 291 ┆ … ┆ null ┆ 0891/fff86e ┆ Nancy_Landr ┆ 1 │
│ 5f-400a-91e ┆ ┆ ┆ ┆ ┆ ┆ 8b795f400a9 ┆ y ┆ │
│ 8-565bbb… ┆ ┆ ┆ ┆ ┆ ┆ 1e8565bb… ┆ ┆ │
│ fff926d7-cc ┆ 330 ┆ 180 ┆ 339 ┆ … ┆ Light_morph ┆ 0660/fff926 ┆ Ruth_Sulliv ┆ 1 │
│ ad-4788-839 ┆ ┆ ┆ ┆ ┆ ┆ d7ccad47888 ┆ an ┆ │
│ e-97af2d… ┆ ┆ ┆ ┆ ┆ ┆ 39e97af2… ┆ ┆ │
│ fffa33ef-a7 ┆ 184 ┆ 94 ┆ 258 ┆ … ┆ null ┆ 0492/fffa33 ┆ Gerry_Dewag ┆ 1 │
│ 65-408d-8d6 ┆ ┆ ┆ ┆ ┆ ┆ efa765408d8 ┆ he ┆ │
│ 6-6efc7f… ┆ ┆ ┆ ┆ ┆ ┆ d666efc7… ┆ ┆ │
│ ffff0d87-bc ┆ 102 ┆ 210 ┆ 461 ┆ … ┆ Adult_Male ┆ 0372/ffff0d ┆ Muriel_Nedd ┆ 0 │
│ 84-4ef2-a47 ┆ ┆ ┆ ┆ ┆ ┆ 87bc844ef2a ┆ ermeyer ┆ │
│ e-a4bfa4… ┆ ┆ ┆ ┆ ┆ ┆ 47ea4bfa… ┆ ┆ │
│ fffff3a5-2a ┆ 281 ┆ 164 ┆ 524 ┆ … ┆ null ┆ 0880/fffff3 ┆ Dominic_She ┆ 0 │
│ 75-47d0-887 ┆ ┆ ┆ ┆ ┆ ┆ a52a7547d08 ┆ rony ┆ │
│ f-03871e… ┆ ┆ ┆ ┆ ┆ ┆ 87f03871… ┆ ┆ │
└─────────────┴──────┴──────┴──────────┴───┴─────────────┴─────────────┴─────────────┴─────────────┘
And then let’s explore a number of characteristics:
print(metadata.columns)
print(metadata.row(0))
print(metadata.row(-1))['UUID', 'bb_x', 'bb_y', 'bb_width', 'bb_height', 'class_id', 'species', 'subcategory', 'path', 'photographer', 'is_training_img']
('0000139e-21dc-4d0c-bfe1-4cae3c85c829', 83, 59, 128, 228, 817, 'Oak_Titmouse', None, '0817/0000139e21dc4d0cbfe14cae3c85c829.jpg', 'Ruth_Cantwell', 0)
('fffff3a5-2a75-47d0-887f-03871e3f9a37', 281, 164, 524, 279, 880, 'Black-throated_Gray_Warbler', None, '0880/fffff3a52a7547d0887f03871e3f9a37.jpg', 'Dominic_Sherony', 0)
metadata.head()shape: (5, 11)
┌─────────────┬──────┬──────┬──────────┬───┬─────────────┬─────────────┬─────────────┬─────────────┐
│ UUID ┆ bb_x ┆ bb_y ┆ bb_width ┆ … ┆ subcategory ┆ path ┆ photographe ┆ is_training │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ r ┆ _img │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ str ┆ str ┆ --- ┆ --- │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ str ┆ i64 │
╞═════════════╪══════╪══════╪══════════╪═══╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 0000139e-21 ┆ 83 ┆ 59 ┆ 128 ┆ … ┆ null ┆ 0817/000013 ┆ Ruth_Cantwe ┆ 0 │
│ dc-4d0c-bfe ┆ ┆ ┆ ┆ ┆ ┆ 9e21dc4d0cb ┆ ll ┆ │
│ 1-4cae3c… ┆ ┆ ┆ ┆ ┆ ┆ fe14cae3… ┆ ┆ │
│ 0000d9fc-4e ┆ 328 ┆ 88 ┆ 163 ┆ … ┆ null ┆ 0860/0000d9 ┆ Christopher ┆ 0 │
│ 02-4c06-a0a ┆ ┆ ┆ ┆ ┆ ┆ fc4e024c06a ┆ _L._Wood_Ch ┆ │
│ f-a55cfb… ┆ ┆ ┆ ┆ ┆ ┆ 0afa55cf… ┆ ris_Wood ┆ │
│ 00019306-9d ┆ 174 ┆ 367 ┆ 219 ┆ … ┆ null ┆ 0900/000193 ┆ Ryan_Schain ┆ 0 │
│ 83-4334-b25 ┆ ┆ ┆ ┆ ┆ ┆ 069d834334b ┆ ┆ │
│ 5-a44774… ┆ ┆ ┆ ┆ ┆ ┆ 255a4477… ┆ ┆ │
│ 0001afd4-99 ┆ 307 ┆ 179 ┆ 492 ┆ … ┆ Nonbreeding ┆ 0645/0001af ┆ Laura_Erick ┆ 1 │
│ a1-4a67-b94 ┆ ┆ ┆ ┆ ┆ /juvenile ┆ d499a14a67b ┆ son ┆ │
│ 0-d41941… ┆ ┆ ┆ ┆ ┆ ┆ 940d4194… ┆ ┆ │
│ 000332b8-99 ┆ 395 ┆ 139 ┆ 262 ┆ … ┆ null ┆ 0929/000332 ┆ Dan_Irizarr ┆ 0 │
│ 7c-4540-964 ┆ ┆ ┆ ┆ ┆ ┆ b8997c45409 ┆ y ┆ │
│ 7-2f0a84… ┆ ┆ ┆ ┆ ┆ ┆ 6472f0a8… ┆ ┆ │
└─────────────┴──────┴──────┴──────────┴───┴─────────────┴─────────────┴─────────────┴─────────────┘
metadata.tail()shape: (5, 11)
┌─────────────┬──────┬──────┬──────────┬───┬─────────────┬─────────────┬─────────────┬─────────────┐
│ UUID ┆ bb_x ┆ bb_y ┆ bb_width ┆ … ┆ subcategory ┆ path ┆ photographe ┆ is_training │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ r ┆ _img │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ str ┆ str ┆ --- ┆ --- │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ str ┆ i64 │
╞═════════════╪══════╪══════╪══════════╪═══╪═════════════╪═════════════╪═════════════╪═════════════╡
│ fff86e8b-79 ┆ 344 ┆ 163 ┆ 291 ┆ … ┆ null ┆ 0891/fff86e ┆ Nancy_Landr ┆ 1 │
│ 5f-400a-91e ┆ ┆ ┆ ┆ ┆ ┆ 8b795f400a9 ┆ y ┆ │
│ 8-565bbb… ┆ ┆ ┆ ┆ ┆ ┆ 1e8565bb… ┆ ┆ │
│ fff926d7-cc ┆ 330 ┆ 180 ┆ 339 ┆ … ┆ Light_morph ┆ 0660/fff926 ┆ Ruth_Sulliv ┆ 1 │
│ ad-4788-839 ┆ ┆ ┆ ┆ ┆ ┆ d7ccad47888 ┆ an ┆ │
│ e-97af2d… ┆ ┆ ┆ ┆ ┆ ┆ 39e97af2… ┆ ┆ │
│ fffa33ef-a7 ┆ 184 ┆ 94 ┆ 258 ┆ … ┆ null ┆ 0492/fffa33 ┆ Gerry_Dewag ┆ 1 │
│ 65-408d-8d6 ┆ ┆ ┆ ┆ ┆ ┆ efa765408d8 ┆ he ┆ │
│ 6-6efc7f… ┆ ┆ ┆ ┆ ┆ ┆ d666efc7… ┆ ┆ │
│ ffff0d87-bc ┆ 102 ┆ 210 ┆ 461 ┆ … ┆ Adult_Male ┆ 0372/ffff0d ┆ Muriel_Nedd ┆ 0 │
│ 84-4ef2-a47 ┆ ┆ ┆ ┆ ┆ ┆ 87bc844ef2a ┆ ermeyer ┆ │
│ e-a4bfa4… ┆ ┆ ┆ ┆ ┆ ┆ 47ea4bfa… ┆ ┆ │
│ fffff3a5-2a ┆ 281 ┆ 164 ┆ 524 ┆ … ┆ null ┆ 0880/fffff3 ┆ Dominic_She ┆ 0 │
│ 75-47d0-887 ┆ ┆ ┆ ┆ ┆ ┆ a52a7547d08 ┆ rony ┆ │
│ f-03871e… ┆ ┆ ┆ ┆ ┆ ┆ 87f03871… ┆ ┆ │
└─────────────┴──────┴──────┴──────────┴───┴─────────────┴─────────────┴─────────────┴─────────────┘
import random
random.seed(123)
metadata.sample()shape: (1, 11)
┌─────────────┬──────┬──────┬──────────┬───┬─────────────┬─────────────┬─────────────┬─────────────┐
│ UUID ┆ bb_x ┆ bb_y ┆ bb_width ┆ … ┆ subcategory ┆ path ┆ photographe ┆ is_training │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ r ┆ _img │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ str ┆ str ┆ --- ┆ --- │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ str ┆ i64 │
╞═════════════╪══════╪══════╪══════════╪═══╪═════════════╪═════════════╪═════════════╪═════════════╡
│ b20cc001-80 ┆ 382 ┆ 236 ┆ 308 ┆ … ┆ Male ┆ 0780/b20cc0 ┆ Alex_Burdo ┆ 0 │
│ f0-4280-9cd ┆ ┆ ┆ ┆ ┆ ┆ 0180f042809 ┆ ┆ │
│ 5-b9b569… ┆ ┆ ┆ ┆ ┆ ┆ cd5b9b56… ┆ ┆ │
└─────────────┴──────┴──────┴──────────┴───┴─────────────┴─────────────┴─────────────┴─────────────┘
print(metadata.schema)
print(metadata.shape)Schema({'UUID': String, 'bb_x': Int64, 'bb_y': Int64, 'bb_width': Int64, 'bb_height': Int64, 'class_id': Int64, 'species': String, 'subcategory': String, 'path': String, 'photographer': String, 'is_training_img': Int64})
(48562, 11)
print(metadata.glimpse())Rows: 48562
Columns: 11
$ UUID <str> '0000139e-21dc-4d0c-bfe1-4cae3c85c829', '0000d9fc-4e02-4c06-a0af-a55cfb16b12b', '00019306-9d83-4334-b255-a447742edce3', '0001afd4-99a1-4a67-b940-d419413e23b3', '000332b8-997c-4540-9647-2f0a8495aecf', '000343bd-5215-49ba-ab9c-7c97a70ac1a5', '0004ff8d-0cc8-47ee-94ba-43352a8b9eb4', '0007181f-a727-4481-ad89-591200c61b9d', '00071e20-8156-4bd8-b5ca-6445c2560ee5', '0007acfc-c0e6-4393-9ab6-02215a82ef63'
$ bb_x <i64> 83, 328, 174, 307, 395, 120, 417, 47, 260, 193
$ bb_y <i64> 59, 88, 367, 179, 139, 210, 109, 194, 146, 291
$ bb_width <i64> 128, 163, 219, 492, 262, 587, 221, 819, 578, 526
$ bb_height <i64> 228, 298, 378, 224, 390, 357, 467, 573, 516, 145
$ class_id <i64> 817, 860, 900, 645, 929, 652, 951, 900, 988, 400
$ species <str> 'Oak_Titmouse', 'Ovenbird', 'Savannah_Sparrow', 'Eared_Grebe', 'Eastern_Phoebe', 'Yellow-crowned_Night-Heron', 'Florida_Scrub-Jay', 'Savannah_Sparrow', 'Yellow-headed_Blackbird', 'Herring_Gull'
$ subcategory <str> null, null, null, 'Nonbreeding/juvenile', null, 'Immature', null, null, 'Female/Immature_Male', 'Adult'
$ path <str> '0817/0000139e21dc4d0cbfe14cae3c85c829.jpg', '0860/0000d9fc4e024c06a0afa55cfb16b12b.jpg', '0900/000193069d834334b255a447742edce3.jpg', '0645/0001afd499a14a67b940d419413e23b3.jpg', '0929/000332b8997c454096472f0a8495aecf.jpg', '0652/000343bd521549baab9c7c97a70ac1a5.jpg', '0951/0004ff8d0cc847ee94ba43352a8b9eb4.jpg', '0900/0007181fa7274481ad89591200c61b9d.jpg', '0988/00071e2081564bd8b5ca6445c2560ee5.jpg', '0400/0007acfcc0e643939ab602215a82ef63.jpg'
$ photographer <str> 'Ruth_Cantwell', 'Christopher_L._Wood_Chris_Wood', 'Ryan_Schain', 'Laura_Erickson', 'Dan_Irizarry', 'Ken_Schneider', 'Velma_Knowles', 'Matt_Tillett', 'Terry_Gray', 'Cory_Gregory'
$ is_training_img <i64> 0, 0, 0, 1, 0, 0, 0, 1, 1, 0
None
metadata.describe()shape: (9, 12)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ statistic ┆ UUID ┆ bb_x ┆ bb_y ┆ … ┆ subcatego ┆ path ┆ photograp ┆ is_train │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ ry ┆ --- ┆ her ┆ ing_img │
│ str ┆ str ┆ f64 ┆ f64 ┆ ┆ --- ┆ str ┆ --- ┆ --- │
│ ┆ ┆ ┆ ┆ ┆ str ┆ ┆ str ┆ f64 │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ count ┆ 48562 ┆ 48562.0 ┆ 48562.0 ┆ … ┆ 23589 ┆ 48562 ┆ 48562 ┆ 48562.0 │
│ null_coun ┆ 0 ┆ 0.0 ┆ 0.0 ┆ … ┆ 24973 ┆ 0 ┆ 0 ┆ 0.0 │
│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ mean ┆ null ┆ 221.68531 ┆ 158.53412 ┆ … ┆ null ┆ null ┆ null ┆ 0.492752 │
│ ┆ ┆ ┆ 1 ┆ ┆ ┆ ┆ ┆ │
│ std ┆ null ┆ 133.05486 ┆ 80.976264 ┆ … ┆ null ┆ null ┆ null ┆ 0.499953 │
│ ┆ ┆ 4 ┆ ┆ ┆ ┆ ┆ ┆ │
│ min ┆ 0000139e- ┆ 0.0 ┆ 0.0 ┆ … ┆ Adult ┆ 0295/01f5 ┆ A._Walton ┆ 0.0 │
│ ┆ 21dc-4d0c ┆ ┆ ┆ ┆ ┆ 3d6bf5e44 ┆ ┆ │
│ ┆ -bfe1-4ca ┆ ┆ ┆ ┆ ┆ 9438d2bb7 ┆ ┆ │
│ ┆ e3c… ┆ ┆ ┆ ┆ ┆ 9e0… ┆ ┆ │
│ 25% ┆ null ┆ 115.0 ┆ 99.0 ┆ … ┆ null ┆ null ┆ null ┆ 0.0 │
│ 50% ┆ null ┆ 205.0 ┆ 149.0 ┆ … ┆ null ┆ null ┆ null ┆ 0.0 │
│ 75% ┆ null ┆ 315.0 ┆ 208.0 ┆ … ┆ null ┆ null ┆ null ┆ 1.0 │
│ max ┆ fffff3a5- ┆ 837.0 ┆ 799.0 ┆ … ┆ Yellow-sh ┆ 1010/ff41 ┆ www.burly ┆ 1.0 │
│ ┆ 2a75-47d0 ┆ ┆ ┆ ┆ afted ┆ 92ee5a164 ┆ bird.com ┆ │
│ ┆ -887f-038 ┆ ┆ ┆ ┆ ┆ de684149d ┆ ┆ │
│ ┆ 71e… ┆ ┆ ┆ ┆ ┆ 926… ┆ ┆ │
└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘
Learn about the data
Now that we have the metadata organized, let’s get to know our data:
print(f"There are {len(metadata)} images in the dataset.")There are 48562 images in the dataset.
metadata_train = metadata.filter(pl.col('is_training_img') == 1)
print(f"""
There are:
- {len(metadata_train)} images in the training set,
- {len(metadata) - len(metadata_train)} in the validation set.
""")
There are:
- 23929 images in the training set,
- 24633 in the validation set.
class_id = metadata.unique(pl.col('class_id'))
species = metadata.unique(pl.col('species'))
print(f"There are {len(class_id)} different classes and {len(species)} different species in the dataset.")There are 555 different classes and 405 different species in the dataset.
train_class_id_group_length = metadata_train.group_by(pl.col('class_id')).len()
print(f"""
The number of images per class in the training set varies from {train_class_id_group_length.min().select(pl.col('len')).item()} to {train_class_id_group_length.max().select(pl.col('len')).item()},
with an average of {round(train_class_id_group_length.mean().select(pl.col('len')).item())} images per class.
""")
The number of images per class in the training set varies from 4 to 60,
with an average of 43 images per class.
train_species_group_length = metadata_train.group_by(pl.col('species')).len()
print(f"""
The number of images per species in the training set varies from {train_species_group_length.min().select(pl.col('len')).item()} to {train_species_group_length.max().select(pl.col('len')).item()},
with an average of {round(train_species_group_length.mean().select(pl.col('len')).item())} images per species.
""")
The number of images per species in the training set varies from 6 to 221,
with an average of 59 images per species.
subcategory = metadata.unique(pl.col("subcategory"))
example_list = subcategory.get_column('subcategory').drop_nulls().head(10).to_list()
example_list_cleaned = [x.replace('_', ' ') for x in example_list]
print(f"""
There are {len(subcategory)} species subcategories, such as:
- {'\n- '.join(example_list_cleaned)}
- etc.
""")
There are 61 species subcategories, such as:
- Female/Immature male
- Immature/Juvenile
- Breeding Myrtle
- Nonbreeding Adult
- Light morph adult
- Winter/juvenile Myrtle
- Breeding adult
- Adult
- Blue morph
- Oregon
- etc.
Summary metadata
Let’s summarize the info we gathered from the metadata:
| Category | Number |
|---|---|
| Images | 48_562 |
| Training images | 23_929 |
| Validation images | 24_633 |
| Classes (species with their subcategories) | 555 |
| Species | 405 |
| Average number of images per class in the training set | 43 |
| Average number of images per species in the training set | 59 |
Save DataFrame to Parquet
To make it easier to retrieve information from the metadata later on, we can save the DataFrame to file.
Parquet is an open-source, columnar, and extremely efficient binary file format for tabular data. Unlike in CSV or JSON files, the data is compressed, making it efficient for storage space. It is also excellent for query performance. Always prefer it over text-based formats.
metadata.write_parquet('metadata.parquet')Our metadata is ready. We can now start working with the pictures.