Compiling the metadata

Author

Marie-Hélène Burle

In this section, we process some of the metadata associated with the NABirds dataset by creating a Polars DataFrame collecting all the information we will need while processing the images and training our model. This allows us to get some information about the dataset.

Polars is a modern and ultra fast package that you should use instead of pandas whenever you can if you care about performance.

Metadata files

In addition to images, the dataset comes with a number of text files. To understand the dataset, of course, the place to start is by reading … the README.

You can find it in full in this accordion box, with a summary below:

README

The nabirds dataset

Versions

v0 - June 2015: initial release

For more information about the dataset, visit the project websites:

http://www.vision.caltech.edu/visipedia http://vision.cornell.edu/se3/projects/visipedia/ http://dl.allaboutbirds.org/nabirds

If you use the dataset in a publication, please cite the dataset in the style described on the dataset website (see url above).

Please see the nabirds.py file for example code on using the data. You can visualize images and annotations by running: python nabirds.py Make sure you are in the nabirds/ directory.

Directory information

images/

The images organized in subdirectories based on species. See IMAGES AND CLASS LABELS section below for more info.
parts/

11 part locations per image. See PART LOCATIONS section below for more info.

Images and class labels

Images are contained in the directory images/, with 555 subdirectories (one for each bird category).

List of image files (images.txt)

The list of image file names is contained in the file images.txt, with each line corresponding to one image:

Train/test split (train_test_split.txt)

The suggested train/test split is contained in the file train_test_split.txt, with each line corresponding to one image:

where corresponds to the ID in images.txt, and a value of 1 or 0 for denotes that the file is in the training or test set, respectively.

Image sizes (sizes.txt)

The size of each image in pixels:

where corresponds to the ID in images.txt, and and correspond to the width and height of the image in pixels.

Image photographers (photographers.txt)

The photographer for each image:

where corresponds to the ID in images.txt, and corresponds to the name of the photographer that took the photo. Please be considerate and display the photographer’s name when displaying their image.

List of class names (classes.txt)

The list of class names (bird species) is contained in the file classes.txt, with each line corresponding to one class:

Image class labels (image_class_labels.txt)

The ground truth class labels (bird species labels) for each image are contained in the file image_class_labels.txt, with each line corresponding to one image:

where and correspond to the IDs in images.txt and classes.txt, respectively.

Class hierarchy (hierarchy.txt)

The ground truth class labels (bird species labels) for each image are contained in the file image_class_labels.txt, with each line corresponding to one image:

where and correspond to the IDs in classes.txt.

Bounding boxes

Each image contains a single bounding box label. Bounding box labels are contained in the file bounding_boxes.txt, with each line corresponding to one image:

where corresponds to the ID in images.txt, and , , , and are all measured in pixels.

Part locations

List of part names (parts/parts.txt)

The list of all part names is contained in the file parts/parts.txt, with each line corresponding to one part:

Part locations (parts/part_locs.txt)

The set of all ground truth part locations is contained in the file parts/part_locs.txt, with each line corresponding to the annotation of a particular part in a particular image:

where and correspond to the IDs in images.txt and parts/parts.txt, respectively. and denote the pixel location of the center of the part. is 0 if the part is not visible in the image and 1 otherwise.

Each image is associated with a UUID.

We won’t need all the information provided with this dataset for this course, but what we need is contained in the following files:

Name	Content
bounding_boxes.txt	List of UUIDs and their corresponding bounding boxes (one bounding box per image, just around the bird)
classes.txt	List of class ids and corresponding class names
image_class_labels.txt	List of UUIDs and their corresponding class ids
images.txt	List of UUIDs and their corresponding file names
photographers.txt	List of UUIDs and their corresponding photographers
sizes.txt	List of UUIDs and their corresponding width and height
train_test_split.txt	List of UUIDs and 1 or 0 depending on whether the image is for training or validation respectively (the dataset comes with a suggested split)

The README has one request:

Please be considerate and display the photographer’s name when displaying their image.

We will make sure to follow it.

Clean problem files

Two of the files are problematic because they are jagged: the number of elements per line is inconsistent.

Let’s create cleaning functions that write cleaned up copies of these files.

To clean the classes.txt file:

def clean_classes_file(input_filepath, output_filepath):
    """
    Remove commas, remove parenthesis, and replace all spaces except the first space
    on each line and the space between species name and subcategory info with underscores.

    Args:
        input_filepath (str): the path of the input file.
        output_filepath (str): the path of the output file.
    """
    with open(input_filepath, 'r') as infile, \
         open(output_filepath, 'w') as outfile:

        for line in infile:
            # Remove commas and ending parenthesis
            cleaned_line = line.replace(',', '').replace(')', '')

            # Strip newline characters
            cleaned_line = cleaned_line.strip()

            # Split line into two parts based on the first space
            parts = cleaned_line.split(' ', 1)

            # Replace spaces in the second part with underscores
            part2_cleaned = parts[1].replace(' ', '_').replace('_(', ' ')

            final_line = f'{parts[0]} {part2_cleaned}\n'

            outfile.write(final_line)

To clean the photographers.txt file:

def clean_photographer_file(input_filepath, output_filepath):
    """
    Remove commas, remove quotes, and replace all spaces except the first space
    on each line with underscores.

    Args:
        input_filepath (str): the path of the input file.
        output_filepath (str): the path of the output file.
    """
    with open(input_filepath, 'r') as infile, \
         open(output_filepath, 'w') as outfile:

        for line in infile:
            # Remove quotes and commas
            cleaned_line = line.replace('"', '').replace(',', '')

            # Strip newline characters
            cleaned_line = cleaned_line.strip()

            # Split line into two parts based on the first space
            parts = cleaned_line.split(' ', 1)

            # Replace spaces in the second part with underscores
            part2_cleaned = parts[1].replace(' ', '_')

            final_line = f'{parts[0]} {part2_cleaned}\n'

            outfile.write(final_line)

Then we can apply the function on our files:

base_dir = '<path-of-the-nabirds-dir>'

To be replaced by actual path: in our training cluster, the base_dir is at /project/def-sponsor00/nabirds:

base_dir = '/project/def-sponsor00/nabirds'

You will not be able to run the following chunk in the training cluster because I did not give you write access to the dataset. This is on purpose to avoid everyone trying to write to the same file at the same time.

I already created the cleaned files.

import os

clean_photographer_file(
    os.path.join(base_dir, 'photographers.txt'),
    os.path.join(base_dir, 'photographers_cleaned.txt')
)

clean_classes_file(
    os.path.join(base_dir, 'classes.txt'),
    os.path.join(base_dir, 'classes_cleaned.txt')
)

Create variables

For convenience, let’s create variables with the path of the various files we need:

bb_file = os.path.join(base_dir, 'bounding_boxes.txt')
class_id_to_name_file = os.path.join(base_dir, 'classes_cleaned.txt')
class_id_file = os.path.join(base_dir, 'image_class_labels.txt')
path_file = os.path.join(base_dir, 'images.txt')
photographer_file = os.path.join(base_dir, 'photographers_cleaned.txt')
size_file = os.path.join(base_dir, 'sizes.txt')
train_test_split_file = os.path.join(base_dir, 'train_test_split.txt')

Create a DataFrame

Now it’s time to put all the data together in one DataFrame.

First, we create a series of DataFrames from each text file:

import polars as pl

bb = pl.read_csv(
    bb_file,
    separator=' ',
    has_header=False,
    new_columns=['UUID', 'bb_x', 'bb_y', 'bb_width', 'bb_height']
)

class_id = pl.read_csv(
    class_id_file,
    separator=' ',
    has_header=False,
    new_columns=['UUID', 'class_id']
)

The class_id_to_name_file is also fairly complicated: some bird species name are followed by additional information (e.g. “Adult” or “Immature”) in parenthesis. If we want to train a model to identify bird species (rather than classes including the subcategories), we need to separate the two (see also section below for more explanations on this).

An easy way to quickly check what that additional information looks like is to run in the terminal, not in Python:

rg "\(" /project/def-sponsor00/nabirds/classes.txt | fzf

In order to split species name on the one hand and additional information on the other, we need to create 3 columns instead of 2 for this file. The problem is that many rows only have 2 elements (the additional info is not often present).

The shell command above allows to quickly look for the first occurrence of the additional info: it appears at line 295.

Polars scans the first 100 elements by default to determine the schema or mapping for the DataFrame. We need to increase this value to at least 295 to make sure that it detects the 3^rd column during the reading in of the file:

class_id_to_name = pl.read_csv(
    class_id_to_name_file,
    separator=' ',
    has_header=False,
    infer_schema_length=296,
    new_columns=['class_id', 'species', 'subcategory']
)

path = pl.read_csv(
    path_file,
    separator=' ',
    has_header=False,
    new_columns=['UUID', 'path']
)

photographer = pl.read_csv(
    photographer_file,
    separator=' ',
    has_header=False,
    new_columns=['UUID', 'photographer']
)

size = pl.read_csv(
    size_file,
    separator=' ',
    has_header=False,
    new_columns=['UUID', 'width', 'height']
)

train_test_split = pl.read_csv(
    train_test_split_file,
    separator=' ',
    has_header=False,
    new_columns=['UUID', 'is_training_img']
)

We can use polars.read_csv even though we have text files because our files are space separated value files. So they function like CSV files with the exception that we have to set the value of the separator argument to .

Then we can combine the two DataFrames dealing with classes so that the birds identifications becomes directly associated with the birds UUIDs:

classes_metadata = (
    class_id.join(class_id_to_name, on='class_id')
)

Finally, we combine all the DataFrames:

initial_metadata = (
    bb.join(classes_metadata, on='UUID')
    .join(path, on='UUID')
    .join(photographer, on='UUID')
    .join(size, on='UUID')
    .join(train_test_split, on='UUID')
)

Format strings

We now have all those underscores in the species, subcategory, and photographer columns. We needed them to read in the files properly into DataFrames, but we now want to format those strings properly:

formatted_metadata = initial_metadata.with_columns(
    pl.col('species').str.replace_all(r'_', ' ').alias('species_name'),
    pl.col('subcategory').str.replace_all(r'_', ' '),
    pl.col('photographer').str.replace_all(r'_', ' ')
).drop('species')

We are also renaming species to species_name because we will add a species_id in the section below.

Create species mapping

We could use class_id as our training labels. The problem with that is that not all the labels are disjunct. For instance:

class_id 175 is the species Red-shouldered Haw with the subcategory None
class_id 359 is the species Red-shouldered Haw with the subcategory Adult
class_id 658 is the species Red-shouldered Haw with the subcategory Immature

And a Red-shouldered Haw is either an immature or an adult. So class_id 175 overlaps with either of the other two. This is not good to train our model.

This emphasizes that you need to know your data very well before you start training a model with it. It is important to spend the time to explore it at length, otherwise the training will fail or perform poorly and you will not understand why.

There is an additional file in this dataset called hierarchy.txt that gives a hierarchy of the various classes. In the example above, this shows that both 658 and 359 fall under the category 175. So we could play with that to solve the problem.

In this exercise though, I decided that I didn’t want to train a model that identifies birds at the level of these categories (Red-shouldered Haw adult vs Red-shouldered Haw immature), but at the level of the species (i.e. simply Red-shouldered Haw).

How you approach this depends on the model you want to create and what exactly you want it to be able to do.

If we use species as the labels, we need to create a mapping for them because models cannot calculate loss on strings—they need numeric labels corresponding to the output neurons of your final layer. So we need to associate each species with an integer. This can be done directly in a Polars DataFrame with a dense ranking:

metadata = formatted_metadata.with_columns(
    pl.col('species_name').rank('dense').alias('species_id')
)

Sanity checks

Let’s see what our DataFrame looks like:

metadata

shape: (48_562, 14)
┌───────────────┬──────┬──────┬──────────┬───┬────────┬───────────────┬───────────────┬────────────┐
│ UUID          ┆ bb_x ┆ bb_y ┆ bb_width ┆ … ┆ height ┆ is_training_i ┆ species_name  ┆ species_id │
│ ---           ┆ ---  ┆ ---  ┆ ---      ┆   ┆ ---    ┆ mg            ┆ ---           ┆ ---        │
│ str           ┆ i64  ┆ i64  ┆ i64      ┆   ┆ i64    ┆ ---           ┆ str           ┆ u32        │
│               ┆      ┆      ┆          ┆   ┆        ┆ i64           ┆               ┆            │
╞═══════════════╪══════╪══════╪══════════╪═══╪════════╪═══════════════╪═══════════════╪════════════╡
│ 0000139e-21dc ┆ 83   ┆ 59   ┆ 128      ┆ … ┆ 341    ┆ 0             ┆ Oak Titmouse  ┆ 260        │
│ -4d0c-bfe1-4c ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ ae3c…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 0000d9fc-4e02 ┆ 328  ┆ 88   ┆ 163      ┆ … ┆ 427    ┆ 0             ┆ Ovenbird      ┆ 264        │
│ -4c06-a0af-a5 ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 5cfb…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 00019306-9d83 ┆ 174  ┆ 367  ┆ 219      ┆ … ┆ 1024   ┆ 0             ┆ Savannah      ┆ 322        │
│ -4334-b255-a4 ┆      ┆      ┆          ┆   ┆        ┆               ┆ Sparrow       ┆            │
│ 4774…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 0001afd4-99a1 ┆ 307  ┆ 179  ┆ 492      ┆ … ┆ 680    ┆ 1             ┆ Eared Grebe   ┆ 145        │
│ -4a67-b940-d4 ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 1941…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 000332b8-997c ┆ 395  ┆ 139  ┆ 262      ┆ … ┆ 682    ┆ 0             ┆ Eastern       ┆ 149        │
│ -4540-9647-2f ┆      ┆      ┆          ┆   ┆        ┆               ┆ Phoebe        ┆            │
│ 0a84…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ …             ┆ …    ┆ …    ┆ …        ┆ … ┆ …      ┆ …             ┆ …             ┆ …          │
│ fff86e8b-795f ┆ 344  ┆ 163  ┆ 291      ┆ … ┆ 819    ┆ 1             ┆ Canyon Towhee ┆ 101        │
│ -400a-91e8-56 ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 5bbb…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ fff926d7-ccad ┆ 330  ┆ 180  ┆ 339      ┆ … ┆ 956    ┆ 1             ┆ Rough-legged  ┆ 310        │
│ -4788-839e-97 ┆      ┆      ┆          ┆   ┆        ┆               ┆ Hawk          ┆            │
│ af2d…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ fffa33ef-a765 ┆ 184  ┆ 94   ┆ 258      ┆ … ┆ 800    ┆ 1             ┆ Swallow-taile ┆ 345        │
│ -408d-8d66-6e ┆      ┆      ┆          ┆   ┆        ┆               ┆ d Kite        ┆            │
│ fc7f…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ ffff0d87-bc84 ┆ 102  ┆ 210  ┆ 461      ┆ … ┆ 1024   ┆ 0             ┆ Broad-billed  ┆ 77         │
│ -4ef2-a47e-a4 ┆      ┆      ┆          ┆   ┆        ┆               ┆ Hummingbird   ┆            │
│ bfa4…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ fffff3a5-2a75 ┆ 281  ┆ 164  ┆ 524      ┆ … ┆ 683    ┆ 0             ┆ Black-throate ┆ 57         │
│ -47d0-887f-03 ┆      ┆      ┆          ┆   ┆        ┆               ┆ d Gray        ┆            │
│ 871e…         ┆      ┆      ┆          ┆   ┆        ┆               ┆ Warbler       ┆            │
└───────────────┴──────┴──────┴──────────┴───┴────────┴───────────────┴───────────────┴────────────┘

And then let’s explore a number of characteristics:

print(metadata.columns)
print(metadata.row(0))
print(metadata.row(-1))

['UUID', 'bb_x', 'bb_y', 'bb_width', 'bb_height', 'class_id', 'subcategory', 'path', 'photographer', 'width', 'height', 'is_training_img', 'species_name', 'species_id']
('0000139e-21dc-4d0c-bfe1-4cae3c85c829', 83, 59, 128, 228, 817, None, '0817/0000139e21dc4d0cbfe14cae3c85c829.jpg', 'Ruth Cantwell', 296, 341, 0, 'Oak Titmouse', 260)
('fffff3a5-2a75-47d0-887f-03871e3f9a37', 281, 164, 524, 279, 880, None, '0880/fffff3a52a7547d0887f03871e3f9a37.jpg', 'Dominic Sherony', 1024, 683, 0, 'Black-throated Gray Warbler', 57)

metadata.head()

shape: (5, 14)
┌───────────────┬──────┬──────┬──────────┬───┬────────┬───────────────┬───────────────┬────────────┐
│ UUID          ┆ bb_x ┆ bb_y ┆ bb_width ┆ … ┆ height ┆ is_training_i ┆ species_name  ┆ species_id │
│ ---           ┆ ---  ┆ ---  ┆ ---      ┆   ┆ ---    ┆ mg            ┆ ---           ┆ ---        │
│ str           ┆ i64  ┆ i64  ┆ i64      ┆   ┆ i64    ┆ ---           ┆ str           ┆ u32        │
│               ┆      ┆      ┆          ┆   ┆        ┆ i64           ┆               ┆            │
╞═══════════════╪══════╪══════╪══════════╪═══╪════════╪═══════════════╪═══════════════╪════════════╡
│ 0000139e-21dc ┆ 83   ┆ 59   ┆ 128      ┆ … ┆ 341    ┆ 0             ┆ Oak Titmouse  ┆ 260        │
│ -4d0c-bfe1-4c ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ ae3c…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 0000d9fc-4e02 ┆ 328  ┆ 88   ┆ 163      ┆ … ┆ 427    ┆ 0             ┆ Ovenbird      ┆ 264        │
│ -4c06-a0af-a5 ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 5cfb…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 00019306-9d83 ┆ 174  ┆ 367  ┆ 219      ┆ … ┆ 1024   ┆ 0             ┆ Savannah      ┆ 322        │
│ -4334-b255-a4 ┆      ┆      ┆          ┆   ┆        ┆               ┆ Sparrow       ┆            │
│ 4774…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 0001afd4-99a1 ┆ 307  ┆ 179  ┆ 492      ┆ … ┆ 680    ┆ 1             ┆ Eared Grebe   ┆ 145        │
│ -4a67-b940-d4 ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 1941…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 000332b8-997c ┆ 395  ┆ 139  ┆ 262      ┆ … ┆ 682    ┆ 0             ┆ Eastern       ┆ 149        │
│ -4540-9647-2f ┆      ┆      ┆          ┆   ┆        ┆               ┆ Phoebe        ┆            │
│ 0a84…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
└───────────────┴──────┴──────┴──────────┴───┴────────┴───────────────┴───────────────┴────────────┘

metadata.tail()

shape: (5, 14)
┌───────────────┬──────┬──────┬──────────┬───┬────────┬───────────────┬───────────────┬────────────┐
│ UUID          ┆ bb_x ┆ bb_y ┆ bb_width ┆ … ┆ height ┆ is_training_i ┆ species_name  ┆ species_id │
│ ---           ┆ ---  ┆ ---  ┆ ---      ┆   ┆ ---    ┆ mg            ┆ ---           ┆ ---        │
│ str           ┆ i64  ┆ i64  ┆ i64      ┆   ┆ i64    ┆ ---           ┆ str           ┆ u32        │
│               ┆      ┆      ┆          ┆   ┆        ┆ i64           ┆               ┆            │
╞═══════════════╪══════╪══════╪══════════╪═══╪════════╪═══════════════╪═══════════════╪════════════╡
│ fff86e8b-795f ┆ 344  ┆ 163  ┆ 291      ┆ … ┆ 819    ┆ 1             ┆ Canyon Towhee ┆ 101        │
│ -400a-91e8-56 ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ 5bbb…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ fff926d7-ccad ┆ 330  ┆ 180  ┆ 339      ┆ … ┆ 956    ┆ 1             ┆ Rough-legged  ┆ 310        │
│ -4788-839e-97 ┆      ┆      ┆          ┆   ┆        ┆               ┆ Hawk          ┆            │
│ af2d…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ fffa33ef-a765 ┆ 184  ┆ 94   ┆ 258      ┆ … ┆ 800    ┆ 1             ┆ Swallow-taile ┆ 345        │
│ -408d-8d66-6e ┆      ┆      ┆          ┆   ┆        ┆               ┆ d Kite        ┆            │
│ fc7f…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ ffff0d87-bc84 ┆ 102  ┆ 210  ┆ 461      ┆ … ┆ 1024   ┆ 0             ┆ Broad-billed  ┆ 77         │
│ -4ef2-a47e-a4 ┆      ┆      ┆          ┆   ┆        ┆               ┆ Hummingbird   ┆            │
│ bfa4…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
│ fffff3a5-2a75 ┆ 281  ┆ 164  ┆ 524      ┆ … ┆ 683    ┆ 0             ┆ Black-throate ┆ 57         │
│ -47d0-887f-03 ┆      ┆      ┆          ┆   ┆        ┆               ┆ d Gray        ┆            │
│ 871e…         ┆      ┆      ┆          ┆   ┆        ┆               ┆ Warbler       ┆            │
└───────────────┴──────┴──────┴──────────┴───┴────────┴───────────────┴───────────────┴────────────┘

import random

random.seed(123)
metadata.sample()

shape: (1, 14)
┌───────────────┬──────┬──────┬──────────┬───┬────────┬───────────────┬───────────────┬────────────┐
│ UUID          ┆ bb_x ┆ bb_y ┆ bb_width ┆ … ┆ height ┆ is_training_i ┆ species_name  ┆ species_id │
│ ---           ┆ ---  ┆ ---  ┆ ---      ┆   ┆ ---    ┆ mg            ┆ ---           ┆ ---        │
│ str           ┆ i64  ┆ i64  ┆ i64      ┆   ┆ i64    ┆ ---           ┆ str           ┆ u32        │
│               ┆      ┆      ┆          ┆   ┆        ┆ i64           ┆               ┆            │
╞═══════════════╪══════╪══════╪══════════╪═══╪════════╪═══════════════╪═══════════════╪════════════╡
│ b20cc001-80f0 ┆ 382  ┆ 236  ┆ 308      ┆ … ┆ 723    ┆ 0             ┆ Red-winged    ┆ 300        │
│ -4280-9cd5-b9 ┆      ┆      ┆          ┆   ┆        ┆               ┆ Blackbird     ┆            │
│ b569…         ┆      ┆      ┆          ┆   ┆        ┆               ┆               ┆            │
└───────────────┴──────┴──────┴──────────┴───┴────────┴───────────────┴───────────────┴────────────┘

print(metadata.schema)
print(metadata.shape)

Schema({'UUID': String, 'bb_x': Int64, 'bb_y': Int64, 'bb_width': Int64, 'bb_height': Int64, 'class_id': Int64, 'subcategory': String, 'path': String, 'photographer': String, 'width': Int64, 'height': Int64, 'is_training_img': Int64, 'species_name': String, 'species_id': UInt32})
(48562, 14)

print(metadata.glimpse())

Rows: 48562
Columns: 14
$ UUID            <str> '0000139e-21dc-4d0c-bfe1-4cae3c85c829', '0000d9fc-4e02-4c06-a0af-a55cfb16b12b', '00019306-9d83-4334-b255-a447742edce3', '0001afd4-99a1-4a67-b940-d419413e23b3', '000332b8-997c-4540-9647-2f0a8495aecf', '000343bd-5215-49ba-ab9c-7c97a70ac1a5', '0004ff8d-0cc8-47ee-94ba-43352a8b9eb4', '0007181f-a727-4481-ad89-591200c61b9d', '00071e20-8156-4bd8-b5ca-6445c2560ee5', '0007acfc-c0e6-4393-9ab6-02215a82ef63'
$ bb_x            <i64> 83, 328, 174, 307, 395, 120, 417, 47, 260, 193
$ bb_y            <i64> 59, 88, 367, 179, 139, 210, 109, 194, 146, 291
$ bb_width        <i64> 128, 163, 219, 492, 262, 587, 221, 819, 578, 526
$ bb_height       <i64> 228, 298, 378, 224, 390, 357, 467, 573, 516, 145
$ class_id        <i64> 817, 860, 900, 645, 929, 652, 951, 900, 988, 400
$ subcategory     <str> null, null, null, 'Nonbreeding/juvenile', null, 'Immature', null, null, 'Female/Immature Male', 'Adult'
$ path            <str> '0817/0000139e21dc4d0cbfe14cae3c85c829.jpg', '0860/0000d9fc4e024c06a0afa55cfb16b12b.jpg', '0900/000193069d834334b255a447742edce3.jpg', '0645/0001afd499a14a67b940d419413e23b3.jpg', '0929/000332b8997c454096472f0a8495aecf.jpg', '0652/000343bd521549baab9c7c97a70ac1a5.jpg', '0951/0004ff8d0cc847ee94ba43352a8b9eb4.jpg', '0900/0007181fa7274481ad89591200c61b9d.jpg', '0988/00071e2081564bd8b5ca6445c2560ee5.jpg', '0400/0007acfcc0e643939ab602215a82ef63.jpg'
$ photographer    <str> 'Ruth Cantwell', 'Christopher L. Wood Chris Wood', 'Ryan Schain', 'Laura Erickson', 'Dan Irizarry', 'Ken Schneider', 'Velma Knowles', 'Matt Tillett', 'Terry Gray', 'Cory Gregory'
$ width           <i64> 296, 640, 730, 1024, 1024, 1024, 1024, 1024, 1024, 1024
$ height          <i64> 341, 427, 1024, 680, 682, 768, 683, 819, 768, 681
$ is_training_img <i64> 0, 0, 0, 1, 0, 0, 0, 1, 1, 0
$ species_name    <str> 'Oak Titmouse', 'Ovenbird', 'Savannah Sparrow', 'Eared Grebe', 'Eastern Phoebe', 'Yellow-crowned Night-Heron', 'Florida Scrub-Jay', 'Savannah Sparrow', 'Yellow-headed Blackbird', 'Herring Gull'
$ species_id      <u32> 260, 264, 322, 145, 149, 401, 158, 322, 402, 195

None

metadata.describe()

shape: (9, 15)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ statistic ┆ UUID      ┆ bb_x      ┆ bb_y      ┆ … ┆ height    ┆ is_traini ┆ species_n ┆ species_ │
│ ---       ┆ ---       ┆ ---       ┆ ---       ┆   ┆ ---       ┆ ng_img    ┆ ame       ┆ id       │
│ str       ┆ str       ┆ f64       ┆ f64       ┆   ┆ f64       ┆ ---       ┆ ---       ┆ ---      │
│           ┆           ┆           ┆           ┆   ┆           ┆ f64       ┆ str       ┆ f64      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ count     ┆ 48562     ┆ 48562.0   ┆ 48562.0   ┆ … ┆ 48562.0   ┆ 48562.0   ┆ 48562     ┆ 48562.0  │
│ null_coun ┆ 0         ┆ 0.0       ┆ 0.0       ┆ … ┆ 0.0       ┆ 0.0       ┆ 0         ┆ 0.0      │
│ t         ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
│ mean      ┆ null      ┆ 221.68531 ┆ 158.53412 ┆ … ┆ 712.33555 ┆ 0.492752  ┆ null      ┆ 204.6341 │
│           ┆           ┆           ┆ 1         ┆   ┆           ┆           ┆           ┆ 17       │
│ std       ┆ null      ┆ 133.05486 ┆ 80.976264 ┆ … ┆ 152.49441 ┆ 0.499953  ┆ null      ┆ 116.4767 │
│           ┆           ┆ 4         ┆           ┆   ┆           ┆           ┆           ┆ 52       │
│ min       ┆ 0000139e- ┆ 0.0       ┆ 0.0       ┆ … ┆ 98.0      ┆ 0.0       ┆ Abert's   ┆ 1.0      │
│           ┆ 21dc-4d0c ┆           ┆           ┆   ┆           ┆           ┆ Towhee    ┆          │
│           ┆ -bfe1-4ca ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
│           ┆ e3c…      ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
│ 25%       ┆ null      ┆ 115.0     ┆ 99.0      ┆ … ┆ 639.0     ┆ 0.0       ┆ null      ┆ 107.0    │
│ 50%       ┆ null      ┆ 205.0     ┆ 149.0     ┆ … ┆ 683.0     ┆ 0.0       ┆ null      ┆ 204.0    │
│ 75%       ┆ null      ┆ 315.0     ┆ 208.0     ┆ … ┆ 780.0     ┆ 1.0       ┆ null      ┆ 303.0    │
│ max       ┆ fffff3a5- ┆ 837.0     ┆ 799.0     ┆ … ┆ 1024.0    ┆ 1.0       ┆ Yellow-th ┆ 405.0    │
│           ┆ 2a75-47d0 ┆           ┆           ┆   ┆           ┆           ┆ roated    ┆          │
│           ┆ -887f-038 ┆           ┆           ┆   ┆           ┆           ┆ Warbler   ┆          │
│           ┆ 71e…      ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘

Learn about the data

Now that we have the metadata organized, let’s get to know our data:

print(f"There are {len(metadata)} images in the dataset.")

There are 48562 images in the dataset.

metadata_train = metadata.filter(pl.col('is_training_img') == 1)
print(f"""
There are:
- {len(metadata_train)} images in the training set,
- {len(metadata) - len(metadata_train)} in the validation set.
""")


There are:
- 23929 images in the training set,
- 24633 in the validation set.

class_id = metadata.unique(pl.col('class_id'))
species = metadata.unique(pl.col('species_id'))
print(f"There are {len(class_id)} different classes and {len(species)} different species in the dataset.")

There are 555 different classes and 405 different species in the dataset.

train_class_id_group_length = metadata_train.group_by(pl.col('class_id')).len()
print(f"""
The number of images per class in the training set varies from {train_class_id_group_length.select(pl.min('len')).item()} to {train_class_id_group_length.select(pl.max('len')).item()},
with an average of {round(train_class_id_group_length.select(pl.mean('len')).item())} images per class.
""")


The number of images per class in the training set varies from 4 to 60,
with an average of 43 images per class.

train_species_group_length = metadata_train.group_by(pl.col('species_id')).len()
print(f"""
The number of images per species in the training set varies from {train_species_group_length.select(pl.min('len')).item()} to {train_species_group_length.select(pl.max('len')).item()},
with an average of {round(train_species_group_length.select(pl.mean('len')).item())} images per species.
""")


The number of images per species in the training set varies from 6 to 221,
with an average of 59 images per species.

subcategory = metadata.unique(pl.col("subcategory"))
example_list = subcategory.get_column('subcategory').drop_nulls().head(10).to_list()
example_list_cleaned = [x.replace('_', ' ') for x in example_list]
print(f"""
There are {len(subcategory)} species subcategories, such as:
- {'\n- '.join(example_list_cleaned)}
- etc.
""")


There are 61 species subcategories, such as:
- Nonbreeding Adult
- Female/Immature Male
- Adult
- Nonbreeding/juvenile
- Adult Subadult
- Female/Nonbreeding male
- Breeding Audubon's
- Immature/Juvenile
- Winter/juvenile Myrtle
- Breeding Adult
- etc.

print(f"""
The images widths vary from {metadata.select(pl.min('width')).item()} to {metadata.select(pl.max('width')).item()}, with a mean of {round(metadata.select(pl.mean('width')).item())}
while the heights vary from {metadata.select(pl.min('height')).item()} to {metadata.select(pl.max('height')).item()} with a mean of {round(metadata.select(pl.mean('height')).item())}.
""")


The images widths vary from 90 to 1024, with a mean of 899
while the heights vary from 98 to 1024 with a mean of 712.

Summary metadata

Let’s summarize the info we gathered from the metadata:

Category	Value
Images	48_562
Training images	23_929
Validation images	24_633
Classes (species with their subcategories)	555
Species	405
Average number of images per class in the training set	43
Average number of images per species in the training set	59
Images min width (px)	90
Images max width (px)	1024
Images mean width (px)	899
Images min height (px)	98
Images max height (px)	1024
Images mean height (px)	712

Save DataFrame to Parquet

To make it easier to retrieve information from the metadata later on, we can save the DataFrame to file.

Parquet is an open-source, columnar, and extremely efficient binary file format for tabular data. Unlike in CSV or JSON files, the data is compressed, making it efficient for storage space. It is also excellent for query performance. Always prefer it over text-based formats.

metadata.write_parquet('metadata.parquet')

Our metadata is ready. We can now start working with the pictures.