Table of Contents ( Press the ← key in browser search bar to return TOC)

VoxelGPT using natural language to query an image dataset. Image courtesy of the author.

Unlike tabular data, datasets for computer vision tasks are unstructured — think gobs of pixels, heaps of labels, bags of tags, and some sometimes-structured metadata. Nevertheless, those of us working in computer vision still need to be able to efficiently sift through those troves of data in order to understand the dataset, prepare train and test sets, find patterns, identify edge cases, and evaluate model performance.

When I need to understand visual data, (which is basically all the time), I use the open source library FiftyOne, which defines a powerful Python syntax for querying computer vision data. It’s kind of like SQL for the unstructured world of computer vision data, and it allows me to programmatically filter, sort, and semantically slice datasets consisting of images, videos, and even 3D point clouds.

A few months back, with the ChatGPT craze in full swing, I saw an example application on OpenAI’s website of translating natural language input into SQL queries. While this application was quite basic, and computer vision data is much more complex, it got me thinking: could something similar be done for datasets of images and videos? In other words:

Can we harness the versatility of large language models (LLMs) to translate natural language queries into filtered views of unstructured computer vision datasets?

The answer? Yes we can!

Combining the general linguistic and inference capabilities of LLMs with FiftyOne’s query language, our team at Voxel51 built VoxelGPT: an open source AI Assistant which lets you comprehensively query your computer vision datasets without writing a single line of code!

We did it using langchain, tiktoken, and fiftyone.

You can try it for free at gpt.fiftyone.ai!

This post will walk you through the prompt engineering, software engineering, and copious trial and error that went into building a domain-specific LLM application.

Many times throughout the process, we hit a wall, and feared that the task was infeasible. If you are trying and struggling to build an LLM-powered application, I hope this post gives you the inspiration you need to break through your own walls!

The post is structured as follows:

The Query Language

VoxelGPT using natural language to query an image dataset. Image courtesy of the author.

Before we dive into how we used an LLM to generate queries, it’s worth taking a moment to describe the query language we want the model to translate. Here’s the bare bones info you will need. For a more thorough overview, see the FiftyOne User Guide. If you’re already familiar with the language, feel free to skip to the next section.

Dataset: A Dataset is the computer vision analog of a Table in SQL, or a DataFrame in pandas. It contains all of the information related to a collection of media files.
Sample: A Sample is analogous to a row in a data table. Samples are the atomic elements of a Dataset. Each sample has a filepath, which points to a media file, and stores all other info associated with that piece of data.
Field: A Field is analogous to a column in a data table in that it defines a property of samples (for example, image width, height, and filepath). However, fields are flexible in that they can contain other fields (see Label below).
Label: A Label is a Field that stores semantic ground truth or prediction info. For instance, object detections are stored in Detections label fields, and classifications are stored in Classification label fields.

As in SQL or pandas, you can filter your data with query operations. For example, you may want to query a data table for:

All rows with an entry in column “A” greater than 0.5

However, whereas data tables lend themselves to numeric slicing and filtering, unstructured computer vision data lends itself to semantic slicing, such as:

Retrieving the high confidence “dog” predictions with large bounding boxes in images with at least 3 non-”dog” ground truth detections

Semantically slicing unstructured data requires far more flexibility.

To achieve this level of flexibility, FiftyOne defines a set of ViewStage methods, which encapsulate allowed query operations like filtering, matching, selecting, sorting, and excluding. These methods are loosely analogous to SELECT, WHERE, and ORDER BY in SQL, but because the space of queries is much larger, there are far more methods, and each of these methods has many use cases. See this Views Cheat Sheet for a comprehensive discussion.

You can obtain an arbitrary subset of a Dataset, called a DatasetView, by composing multiple ViewStage operations in sequence.

Here’s what the query language looks like in action: given a dataset named my_dataset, if we wanted to get all “dog” detections in the "predictions” label field of 10 random images in that contain a “cat” (in their "ground_truth” label field), we could use the following query:

view = my_dataset.match(
        F("ground_truth.detections.label").contains(["cat"])
    ).filter_labels(
        "predictions", 
        F("label") == "dog"
    ).take(10)

match() restricts to images with a “cat” in their "ground_truth” detections field, then filter_labels() reaches inside the "predictions” field and retrieves only the “dog” detections, and finally take() randomly selects a subset.

Defining the Task

When we started this journey, we didn’t know what was possible with LLMs. Back in December 2022, I had played around with ChatGPT and even wrote an article Tunnel Vision in Computer Vision: Can ChatGPT See? probing the chatbot’s understanding of computer vision concepts. At the time, it was impressive that a general purpose language model could write mostly correct Python code for basic image processing tasks, but the thought of using GPT-generated code (without human verification) in computer vision applications seemed far-fetched.

After seeing the text-to-SQL example, and reading about GPT-4 reportedly passing the BAR (among other exams) with flying colors, I decided to give GPT-for-CV another try. I asked GPT-4 to generate the Python code for a few simple queries and was pleasantly surprised by its responses. The veracity of GPT-4’s out-of-the-box results left much to be desired, which was understandable given that I had not fed in any information about the FiftyOne library. But the prospects of making this work were tantalizing.

Require Uniformity

As we experimented with prompting techniques (we’ll get to that later), it became immediately apparent that if we wanted to incorporate this into larger pipelines, we’d need to clearly define the task.

Left to its own devices, the GPT model’s responses varied substantially in their format from prompt to prompt.

Here are just two examples, which we will analyze shortly:

Prompt 1:

Your task is to convert input natural language queries into Python code to generate ViewStages for the computer vision library FiftyOne.

Here is your first natural language query: “Images that only contain dogs”

Give me the FiftyOne code.

Response 1:

Sure, here’s the FiftyOne code that generates a ViewStage for images that only contain dogs:

import fiftyone as fo
import fiftyone.zoo as foz

# Load the dataset
dataset = foz.load_zoo_dataset(
    "open-images-v6",
    split="validation",
    label_types=["detections"],
    classes=["Dog"],
)

# Create a view stage that filters out images without any dog detections
view_stage = fo.ViewStage("detections", fo.FilterField("detections", "label", "Dog"))

# Apply the view stage to the dataset
dataset = dataset.add_stage(view_stage)

# View the dataset
session = fo.launch_app(dataset)

This code loads the Open Images v6 validation dataset with only the “detections” label type for dogs. It then creates a view stage that filters out any images that do not have any dog detections. Finally, the view stage is applied to the dataset and launched in the FiftyOne App.

Prompt 2:

Your task is to convert input natural language queries into Python code to generate ViewStages for the machine learning library FiftyOne.

Here is your first natural language query: “most unique images with a false positive detection”

Give me the FiftyOne code.

Response 2:

Sure, here’s the FiftyOne code to generate ViewStages for the natural language query “most unique images with a false positive detection” using the FiftyOne library in Python:

import fiftyone as fo
import fiftyone.zoo as foz

# Load your dataset
dataset = fo.load_dataset("your_dataset_name")

# Define your false positive detection model
model = foz.load_zoo_model("your_model_name")

# Define your pipeline
pipeline = [
    {"$set": {"predictions": {"$predict": model}}},
    {"$match": {"predictions.mistakes.false_positive": True}},
    {"$group": {"_id": "$filepath"}},
    {"$count": "num_images"},
    {"$sort": {"num_images": -1}},
]

# Run your pipeline and display the results in a view stage
view = dataset.aggregate(pipeline).limit(10).display()

In this code, we first load the FiftyOne library and any necessary dependencies. We then load your dataset using fo.load_dataset().

Next, we define your false positive detection model using foz.load_zoo_model(). This function loads a pre-trained model from the FiftyOne Zoo Model Zoo, which includes a wide range of state-of-the-art models for computer vision.

After that, we define our pipeline using the $set, $match, $group, $count, and $sort aggregation stages. The pipeline first adds a field to each sample in your dataset that contains the model predictions using $set. We then filter the samples to only include those with a false positive detection using $match.

Next, we group the samples by their file paths using $group, and count the number of images in each group using $count. Finally, we sort the groups in descending order of the number of images using $sort.

Finally, we run the pipeline on your dataset and limit the results to the top 10 images with the most false positive detections using limit(). We then display the results in a view stage using display().

Original Post>

Understanding Deep Learning Algorithms that Leverage Unlabeled Data, Part 1: Self-training

Enjoyed this article? Sign up for our newsletter to receive regular insights and stay connected.

How I Turned ChatGPT into an SQL-Like Translator for Image and Video Datasets

The Query Language

Defining the Task

Require Uniformity

Like this:

Related

The Query Language

Defining the Task

Require Uniformity

Share this:

Like this:

Related

Discover more from Global Intelligence and Insight Platform: IT Innovation, ETF Investment, plus Health Wellbeing