MAGNeT: Masked Audio Generation using a Single Non-Autoregressive Transformer

  • Public
  • 798 runs

Run time and cost

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 52 seconds. The predict time for this model varies significantly based on the inputs.


Cog implementation of facebookresearch/MAGNeT

Model Details

Organization developing the model: The FAIR team of Meta AI.

Model date: MAGNeT was trained between November 2023 and January 2024.

Model version: This is the version 1 of the model.

Model type: MAGNeT consists of an EnCodec model for audio tokenization, and a non-autoregressive model based on the transformer architecture for music modeling. The model comes in different sizes: 300M and 1.5B; and two variants: a model trained for text-to-music generation, and a model trained for text-to-sound generation.

Paper or resources for more information: More information can be found in the paper Masked Audio Generation using a Single Non-Autoregressive Transformer.

Citation details: See our paper

License: Code is released under MIT, model weights are released under CC-BY-NC 4.0.

Where to send questions or comments about the model: Questions and comments about MAGNeT can be sent via the GitHub repository of the project, or by opening an issue.

Intended Use

Primary intended use: The primary use of MAGNeT is research on AI-based music generation, including:

  • Research efforts, such as probing and better understanding the limitations of generative models to further improve the state of science
  • Generation of music guided by text to understand current abilities of generative AI models by machine learning amateurs

Primary intended users: The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateur seeking to better understand those models.

Out-of-scope use cases: The model should not be used on downstream applications without further risk evaluation and mitigation. The model should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people. This includes generating music that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.

      title={Masked Audio Generation using a Single Non-Autoregressive Transformer}, 
      author={Alon Ziv and Itai Gat and Gael Le Lan and Tal Remez and Felix Kreuk and Alexandre Défossez and Jade Copet and Gabriel Synnaeve and Yossi Adi},