Make any large language model a better poet

Posted by @joehoover

In this post, we discuss a version of Vicuna-13B that we just released called Poet Vicuna-13B. This model is part of an early-stage project focused on enhancing open source large language models. Poet Vicuna-13B is an implementation of Vicuna-13B that is modified to generate poems and lyrics with specific syllabic patterns. You can use it to rewrite Twinkle, Twinkle Little Star or generate modernist poems with lots of beautiful white space sprinkled with lines of lengths that you choose.

For example, we asked Poet Vicuna-13B to write an eight line poem with these syllable counts [3, 3, 0, 2, 0, 5, 0, 3, 2, 1, 0, 4] in response to this prompt:

Write a poem that explores the concept of time and its impact on our lives—nostalgia, the passage of time, the fleeting nature of moments:

and it generated this poem:

In moments
past and gone

their time

passed so quickly. In

the present
we stand

but for how long?

But, first, how did we get here?

When we want to improve a large language model’s behavior, prompt engineering and training are often the first solutions we reach for. After all, they’re the most common methods to get these models to do what we want them to do. But they’re not the only way to modify the behavior of language models and, in some cases, they’re far from the best solution.

For example, poetry and lyrics generation remains quite difficult for even state of the art language models like GPT-4. As we’ll discuss, prompting and training are probably not the shortest path to developing a sophisticated poetry generator.

Crafting a good piece of poetry is tough work, even for humans. You’ve got to juggle meaning, emotion, rhythm, and on top of all that, there’s the issue of syllables. Now imagine getting a machine to do this. Sounds pretty tricky, right? Even though the big language models out there can whip up some pretty convincing text, getting them to stick to the precise rules of a poem – especially one with a strict syllable count – is another story.

For example, consider this exchange we recently had with GPT-4:

Write a 5 line poem about dogs with 4 syllables in each line.

Furry friends with hearts so true,
Barks of joy, eyes of hue.
In your wag, all troubles flee,
Dog, you’re home, where you should be.
Your love’s a gift, pure and new.

I asked you to write a 5 line poem about dogs with 4 syllables in each line, however you wrote a 5 line poem with 6 or more syllables in each line.

Please follow my instructions more carefully:

Write a 5 line poem about dogs with 4 syllables in each line.


Paws in dirt, nose in air,
Eyes that hold a loyal stare.
Barks at dawn, tail’s delight,
In your warmth, we take flight.
In dog’s love, none compare.

Not only did GPT-4 fail to follow our instructions, it couldn’t even produce one line of poetry that passed the four-syllable constraint.

Language models, while remarkably powerful, aren’t inherently aware of concepts like syllables. They don’t “understand” language in the same way we do; rather, they look for patterns in the text data they were trained on. In the case of GPT-4 and other transformer-based models, the smallest unit they encode is a token. A token can be as short as one character or as long as one word or more in some cases. The tokenizer, the tool used to break down the input into these tokens, has no built-in notion of what a syllable is.

As a result, while the model can generate a stunning variety of responses, it doesn’t inherently “count” syllables or maintain a specific metric pattern. Instead, it makes its decisions based on the patterns it learned during training. If the training data didn’t emphasize a particular characteristic (like strict syllable counts), it’s unlikely the model will spontaneously start generating text that way.

From prose to poem: improving language model poetry generation

Fortunately, there’s already a lot of great work on poetry generation. We don’t have a proper literature review to share (yet!), but we’ll highlight a few interesting pieces that are particularly relevant to what we’re working on. A lot of research on poetry generation focuses on creating training regimes that produce models that are better able to generate poetry.

For instance, Ormazabal et al. (2022) introduced PoeLM, an approach that uses control codes to encode both the number of syllables in a line and the ending syllable, which helps dictate rhyme schemes. They inject these control codes into training data and find that fine-tuned models are better able to generate structured verse. The work by Chakrabarty et al. (2022) is also noteworthy. They developed an instruction-tuned poetry writing assistant, CoPoet, that leverages the power of various T5 models.

Other research, however, focuses on developing rule-based mechanisms that guide the language model generation process. Instead of trying to train a model to generate verse with specific structures, this approach provides guiderails that ensure that generated text satisfies user-specified constraints. For example, Roush et al. (2022) devised a method to manipulate language model output by applying filter functions to the model’s vocabulary before generating text. They showcased this technique in their “Constrained Text Generation Studio” (CTGS) tool, which demonstrated improved performance over traditional fine-tuning.

We find this approach particularly exciting for a few reasons. First, it just makes sense. Text generation is, by defenition, probabalistic and it will remain difficult, at best, to train a model to perfectly adhere to complex formal patterns (e.g. see jsonformer for an example from a different domain). Further, this approach highlights one of the benefits of open source language models. We can hack them in ways that are simply not possible with closed-sourced models and this gives us access to an entire world of creativity and experimentation.

Introducing bragi and Poet Vicuna-13b

We used Roush et al. as our jumping off point. Our first goal was a minimal implementation that would allow us to apply line-level syllabic constraints to the generative process of any Hugging Face Transformers text generator.

We’ve implemented this in a library we’re calling bragi, named after the Norse god of poetry. Bragi is in a very alpha state, but applying its simple mechanisms to Vicuna-13B yielded such a cool experience that we wanted to share it with the community.

So, how does it work?

At the heart of the bragi project is the MetricGenerator class, which implements a logit warper that constrains model vocabulary during the generation process. The logit warper constrains the metric structure of the output by tracking a line-level syllable budget and a number-of-lines budget. This mechanism relies on calculating the cost of each token in terms of syllables, which we implement with the pronouncing library.

In other words, bragi dynamically adjusts the available tokens at each step of the generation process based on the specified syllable pattern. For example, if a line only has room for one more syllable, bragi will only allow tokens with one syllable to be selected. This approach helps maintain the desired metric structure while allowing for creative generation.

To initialize a target metric structure, we either specify a list of integers that represent the number of syllables that each line should contain or an initialization text. For instance, the “Happy Birthday” song, which follows a 6, 6, 7, 6 syllabic pattern, can be used to initialize the metric structure.

And, as you’ve already seen, you can guide generation process with a prompt! This means you can generate a new song or poem following a specific syllabic pattern, but with your own creative flair.

What’s next?

If you spend some time with Poet Vicuna-13B, you’ll notice a few loose ends. For example, it doesn’t always adhere to the specified syllabic pattern. We’re still chasing that down, but we hope to have a fix soon. Some other obvious shortcomings are partial words and occasional nonsense.

However, aside from minor fixes, there are a lot of interesting directions we’d love to take this project. For example, we’d like to take a closer look at Roush et al.’s implementation and see what we could integrate into bragi. We’re also excited about the prospect of adding support for beam search, which will help with generation quality and allow us to experiment with methods like constrained beam search.

There are also many ways we could extend bragi’s mechanisms for constrained generation. For example, we could use the same approach to force word-level syllable counts or perhaps even proper metric patterns (e.g. iambic pentameter for our readers who know their scansion).

Finally, we expect that the best version of a large language model poet would combine the kinds of rule-based interventions we discuss here with carefully curated training data. So, there’s plenty of experimentation to do with data curation and augmentation and model training.

Want to get involved?

In case you’re wondering, Replicate isn’t secretly planning on starting a poetry press! This is just something we’re working on in our spare time because we think it’s fascinating and because we think there’s more to language model development than training and prompting.

If you’re curious about how far and fast we can push this, join us! bragi is, of course, open source and we’d be glad for this to become a community effort.