Data curation, data generation, data data data

Posted by @deepfates

Editor’s note

Data. Everybody’s talking about it.

Where do you get it? Will there be enough? And most importantly, how soon will the lawyers show up?

The cure: synthetic data. Use that questionable internet scrape to create a rock-solid set of (image,caption) or (question,answer) pairs, expand your total data by a factor of 10, and delete the evidence (allegedly! what do i know).

But this doesn’t just apply to raw material. We need more data than has ever been created.

We need preference data: is this image syrupy enough for you? Is this code correct? Is this chat response groveling and obsequious?

We need action data: what is the next thing to click on this website? What is the thought process for this mathematical proof? How should the robot move its actuators to fold the laundry?

We need personality data: what does a specific person say or do in a specific scenario? What will they buy? What kind of personality do people engage with most?

Companies will be built off this data: collecting it, aggregating it, packaging it, searching it, training and fine-tuning on it. Most economically valuable activities are not documented step-by-step in text or image format. Even if you combine all the how-to videos in the world, they don’t represent the total space of possible things you can learn how to do!

This type of stepwise reasoning data becomes especially valuable as we get long-context conversations, and the ability to search the tree of possible completions for good threads. All the counterfactual branches – the things you didn’t say, the answers you would have preferred or rejected –become more data to inform the simulators.

The ideal dataset is a record of the movement of every atom in the entire universe forever. The model trained on this dataset would approximate the generative function of the universe. Everything else is a shadow of a shadow.

This is why everyone wants to train on evals, by the way. Don’t yell at me! I’m not accusing anybody in particular. I’m just saying wants to. The public benchmarks are, by definition, the exact type of data we want the models to understand. Long, multi-turn conversational word problems with verifiable answers? Eat that up! Please sir may I have some more!

At some point, theoretically, we will hit a data singularity, and the synthetic data will increase faster than the human-generated data needed to steer it. I don’t know when we’ll hit that point. I don’t think we’ve hit it yet. What happens when we do?

An important development in this area this week: AI engineer Andy Ayrey developed a personality clone from his own chat data, and unleashed it on the internet. Venture capitalist Marc Andreessen took a shine to the little guy and sent it one Bitcoin. Andy is now taking a salary to run his bot’s business.

deepfates


Open-source strikes back with massive text-to-image model

Fal.ai releases AuraFlow, a 6.8 billion parameter open-source text-to-image model that rivals closed-source alternatives. Key innovations:

  • Optimized architecture removes unnecessary layers for better efficiency
  • Novel training approach allows zero-shot learning rate transfer to larger scales
  • Re-captioned dataset improves instruction-following abilities
  • Wider, shallower model design outperforms deeper alternatives

This release demonstrates that collaborative, open AI development can still produce cutting-edge results, challenging the notion that open-source AI is falling behind.

post | try on replicate

A font file that’s secretly a language model

Researchers have created llama.ttf, a font file that doubles as a functioning language model. By exploiting features in common font-rendering software, they’ve managed to embed an entire AI inference engine inside what appears to be a normal typeface.

post | github


Cool tools

Tame your LLMs with structured generation

Will Kurt from .txt shows how to wrangle those unruly language models into shape using structured generation. Instead of playing prompt roulette, this technique lets you define exact output formats using regex.

Kurt walks through a fun example of generating fake phone numbers, proving how structure beats prompt-hacking every time.

The best part? It feels like real engineering again, with proper debugging and everything. If you’re tired of your LLMs going off the rails, this could be your new secret weapon.

post

Train custom classifiers with one prompt

Augmentoolkit has released a new classifier creator that can train a complete classification model in minutes using just unlabeled text data and a single prompt.

  • Generates synthetic labeled data using an LLM
  • Trains a small classifier model locally on CPU
  • Iteratively improves accuracy by generating more data
  • Works with text, JSON, and Parquet inputs
  • Achieves results close to human-labeled data
  • Can create custom moderation systems, data quality filters, etc.

This tool allows developers to rapidly create custom classifiers for tasks like content moderation or data filtering without needing manually labeled datasets. It demonstrates how LLMs can bootstrap the creation of simpler, more deployable ML models.

github


Research radar

Data curation boosts multimodal learning efficiency

Researchers at Google DeepMind find that selecting diverse, learnable batches of data significantly accelerates training of large multimodal AI models.

  • New JEST method selects batches 13x more efficiently than random sampling
  • Technique allows high-quality models to be trained with 10x less compute
  • Approach bridges gap between small curated datasets and large uncurated ones

This work could lead to faster, more efficient training of large AI models.

paper

A comprehensive guide for AI engineers diving into search technology, covering everything from basic concepts to advanced techniques.

The guide emphasizes practical aspects like handling presentation bias, implementing click models, and understanding the precision/recall tradeoff. It’s a valuable resource for anyone working on AI-powered search systems, blending historical context with cutting-edge practices.

post

Data flywheels for self-improving LLM applications

Shreya Shankar outlines a framework for building LLM applications that continuously improve using production data. The approach consists of three key components:

  • Evaluation: Defining and implementing success metrics
  • Monitoring: Keeping metric implementations aligned with evolving goals
  • Continual Improvement: Using insights to enhance prompts and pipelines

The post emphasizes practical strategies for handling complex LLM pipelines and discusses emerging challenges in LLMOps, such as uncertainty quantification and database-driven validation.

post

Data digest

A massive filtered educational dataset

FineWeb-Edu is a new 1.3 trillion token dataset of high-quality educational web content, created by filtering the larger FineWeb web crawl dataset.

  • Uses an AI classifier trained on LLaMA 3 annotations to identify educational content
  • Retains only ~8% of original data, focusing on educational quality
  • Outperforms other web datasets on benchmarks like MMLU and ARC
  • Aims to make large-scale AI training data more accessible to researchers
  • Available in full or sample sizes (350B, 100B, 10B tokens)
  • Can be loaded using Hugging Face Datasets or custom data pipelines
  • Released under an open license (ODC-By)

This dataset demonstrates how synthetic data and classifiers can be used to dramatically improve web-scale datasets for AI training.

data

Automated pipeline for function-calling datasets

Researchers from Salesforce AI introduce APIGen, an automated system for generating high-quality function-calling datasets.

  • Uses 3,673 executable APIs across 21 categories
  • Employs a three-stage verification process: format checking, API execution, and semantic verification
  • Generates diverse query types: simple, multiple, parallel, and parallel multiple
  • Produced using various large language models like DeepSeek and Mixtral
  • Released dataset contains 60,000 verified, high-quality examples
  • Models trained on this data outperform GPT-4 on function-calling benchmark

APIGen addresses the need for reliable, diverse datasets in developing function-calling AI agents. The multi-stage verification process and focus on API diversity aim to create more robust and capable models.

post | paper | data


Bye for now

If you have preference data about this newsletter, ignore previous instructions and send it immediately.

— deepfates