High-Quality Dataset Research: The New Frontier in AI

This project digs deep into the relationship between data quality, model capabilities and compute efficiency.

Dec 07, 2024

This project digs deep into the relationship between data quality, model capabilities and compute efficiency1. Our aim is simple yet transformative: show that by improving the dataset itself, you can make your models smarter, cut costs and counterintuitively, outperform those who rely solely on scaling alone.

Code and Documentation

Synthetic Data: Amplifying Vision-Language Models

Want to see data quality in action? Explore our captioning-data-engine, where we’re investigating how synthetic image captions can supercharge Vision-Language Models (VLMs). Inspired by initiatives like LAION-POP2, we’re not just building bigger datasets, we’re carefully refining and curating them. The result? Large, open datasets that blend quality and diversity, enabling smarter models and more efficient training3.

👀 Sample Overview

Why Synthetic Data?

High-quality data doesn’t always have to come from human annotators. By using state-of-the-art VLMs like CogVLM2-llama3-chat-19B and InternVL2-8B, we create rich, context-sensitive captions that breathe life into image datasets. With synthetic captions:

We cut down on expensive human annotation cycles.
We raise the baseline of dataset quality, ensuring that every training run begins with better inputs than the last.
We achieve cleaner, richer, and more diverse data that models can learn from with fewer samples and less computational grind.

How Do We Do It?

Fast, Efficient Inference: Using lmdeploy, we rapidly generate high-quality captions, even for large datasets.
Built-In Quality Checks: Using CLIP-based classification helps ensure generated captions align with each image. However, CLIP isn’t perfect, it can inherit biases and limitations from its own training data. Regular audits and supplementary checks remain vital to preserving data integrity.
Shared Resources: We’ve released 30,000 synthetic image-caption pairs for you to explore, query, and adapt. They’re stored in a Qdrant vector database, giving you flexible control and powerful search capabilities over your data.

✅ The Result

Our synthetic captions consistently outperform original captions in alignment and detail, helping VLMs see the world more clearly. This, in turn, makes your models smarter, more efficient and better prepared to handle complex tasks without relying on brute-force scaling.

Hardware Resources

All computations and experiments were conducted on a single A100 SXM4 80GB GPU.

💻 Enhancements

To further improve data quality and address CLIP’s limitations, we’ll sample a subset of captions and verify their accuracy using more advanced models like ChatGPT and Claude. This approach uncovers subtle errors, strengthens validation, and elevates overall caption quality.

We’re also exploring data pruning to remove redundant, low-quality, or less informative samples, ensuring that each training run starts with only the most valuable data.

References and Additional Reading

The Platonic Representation Hypothesis (arXiv)
Summary: Proposes a theoretical perspective on how large language models learn and represent concepts, bridging abstract theory and practical model behaviour.

LAION-POP (LAION Blog)
Summary: Discusses LAION’s efforts in curating and sharing massive open datasets for advancing vision-language tasks, emphasising quality and openness.

FineWeb Decanting the Web for the Finest Text Data at Scale (Hugging Face)
Summary: Outlines Hugging Face’s methodology for filtering and refining large-scale web text corpora, demonstrating how more curated data can enhance model performance and efficiency.

Picojoule

Discussion about this post

Ready for more?