Open source frontier image model, cut objects from videos, new Python web framework from Jeremy Howard

Posted August 2, 2024 by @deepfates

Editor’s note

Open source does not slow down. New models are released as fast as I can try them out. There’s so much to do, so many new creative tools in our hands. Every individual can now be a startup, a brand, a publisher, a movie studio, all from their laptop or their phone.

These new AI tools make it easier to generate new things. It’s even become the buzzword in some circles, “GenAI”. It’s certainly more impressive, and more obvious when a neural net generates something than when it just recognizes patterns in data. But they’re the same thing: we ask the model to spit out patterns where there aren’t any, and it gives us the data it learned to recognize. The model is a mirror of the world.

There was a thread today about how to think about ai application development from Anton Troynikov, co-founder of the vector database company Chroma. It’s good advice, you should read it. The bit I want to highlight here is slide 3:

what is an llm? as far as application development is concerned, an llm is;

an information processing system that processes unstructured information in a common sense way that can be accessed over an api

This is crucial. It processes (recognizes, generates) information (everything else in the computer) in a common sense way – that is, by reflecting it in the mirror of the world. And Anton is right, this is miraculous! We can use it to automate so many processes we’ve been doing manually for decades. We can unbundle intelligence. This is a great thing for software, and for the world at large.

But we must remember that the mirror is made of our language, our videos, the things we chose to share. It’s not a simulation of the world. It’s a simulation of us.

When we use the mirror to generate images, and voices, and stories and records and judgments and all those other unstructured bits of information that used to belong to humans alone, we amplify our own reflection.

We already see this in the consumer market. The only thing more widely adopted than the AI selfie is the AI romance chatbot. We will see ourselves, and each other, through the mirror. With all its biases and flaws. That feedback loop will continue, as future models are trained on the data generated today. Common sense will change, and the world will change to match it.

This might sound depressing, but it’s what humans do. We change ourselves, our environments, our relationships, and then we go on doing our weird funny little human things in the new system of the world, and repeat. That’s what’s speeding up, and every new capability we unlock will spin that loop a little faster. But the thing about feedback loops is, they’re very sensitive to initial conditions.

We’re still early. If you’re reading this, you personally have a great deal of influence over that future world. It will be defined by our actions today. The future is yours to shape.

You know what they say:

You can kiss yourself in the mirror, but only on the lips. - Neal deGrasse Tyson

– deepfates

A powerful new image generator

FLUX.1 is an impressive new suite of image generation models developed by Black Forest Labs, a newly formed company from former Stable Diffusion developers. The models use an advanced “flow matching” technique that enables faster and more controllable generation compared to standard diffusion.

FLUX.1 comes in three variants:

FLUX.1 [pro]: the flagship model with state-of-the-art performance
FLUX.1 [dev]: a distilled version for non-commercial use
FLUX.1 [schnell]: the fastest model, released under the permissive Apache 2.0 license

The models excel at handling complex prompts, generating detailed text, composing intricate scenes, drawing realistic hands, and producing diverse outputs. They support a range of aspect ratios and resolutions up to 2 megapixels.

FLUX.1 is available to try hands-on via the Replicate API. The open-source licensed for FLUX.1 [schnell] will be particularly attractive for developers and researchers.

homepage | try on Replicate

A new standard for image and video cutouts

Meta released SAM 2, the next generation of their Segment Anything Model. It enables real-time object segmentation in images and videos.

SAM 2 extends the original model’s zero-shot capabilities to video. It can segment objects even in unseen domains. The model outperforms prior approaches while being 3x faster.

Meta open-sourced the code, model, and new SA-V dataset with 600k annotations on 51k videos. This opens up applications like:

Video editing
Faster data labeling
Real-time background removal

post | github | dataset | try on Replicate

A small, fine-tunable language model with a lot of intelligence

Google released Gemma 2 2B, a powerful new addition to their Gemma family of open language models. Despite its compact 2 billion parameters, Gemma 2 2B delivers outsized performance, even surpassing all GPT-3.5 models on the LMSYS Chatbot Arena leaderboard – though some note that this may reflect a failing in the leaderboard approach, as the new model compares unfavorably on the MMLU benchmark.

What’s exciting about Gemma 2 2B:

Best-in-class results for its size category
Optimized to run efficiently on a wide range of hardware, from edge devices to the cloud
Weights available under a permissive license, with a free Colab demo

blog post | technical report | try on Replicate

Cool tools

New tool for understanding and controlling large language models

Google’s Gemma Scope is an exciting new tool for understanding what’s going on inside large language models. It uses sparse autoencoders (SAEs) to unpack the dense information processed by Gemma models, giving researchers an unprecedented view into their inner workings.

Over 400 SAEs are available, covering all layers of the Gemma 2 2B and 9B models. By studying the patterns revealed by these expanded views, we can gain insights into how Gemma models identify concepts, process information, and make predictions.

This connects to the broader trend we’ve been tracking around making AI systems more interpretable and steerable. Remember Golden Gate Claude and Garden State Llama?

Gemma Scope has the potential to significantly advance interpretability research and shape the future of accountable AI development. It’s a powerful new resource for the AI community to explore.

blog post | technical report | try on neuronpedia

A new Python web framework from the creator of fast.ai

FastHTML is a new Python web framework that aims to simplify interactive application development. It lets you build apps using pure Python, without the need for separate HTML templates. FastHTML provides built-in support for essentials like authentication, databases, caching, and styling. Deploying your creations to platforms like Railway and Vercel is straightforward.

Under the hood, FastHTML leverages fundamental web standards and technologies. HTML elements are represented using Python “FT” (FastTag) objects. The framework is designed to be extensible with custom Python components.

FastHTML was created by Jeremy Howard, the co-founder of fast.ai, to make it easier for developers to build interactive web apps with Python. The FastHTML docs even include an example of building an image generation app powered by the Replicate API.

Whether you’re building a small prototype or a production-scale application, FastHTML is worth checking out. The team behind it is looking for contributors to help expand the ecosystem.

homepage | docs | vision

Research radar

A distributed training method for large language models

A new paper shows how billion-parameter language models can be trained in a distributed way, without the need for a massive centralized data center. The technique, called federated learning, lets organizations collaborate to train models using their own data and compute resources. The researchers demonstrated that these federated models perform just as well as traditional centralized models.

While still an emerging area, federated learning could one day allow large AI models to be trained by collectives of participants across the world, similar to file-sharing networks. This is an important step towards democratizing access to frontier AI capabilities.

paper

Bye for now

What are you going to do with these new tools? How will you project yourself into the wild new future? Let me know by sending an email response.

You remember how to do that, right? Email. It’s common sense.

— deepfates