← Back to Notes

Getting Started with Meta's Audiocraft

Audiocraft is Meta's open-source toolkit for generating audio and music from text prompts. It includes MusicGen (for music) and AudioGen (for sound effects). Let me walk through getting it running.

Quick Overview

Installation

First, make sure you have a GPU. The models run on CUDA and need significant VRAM. I'm using an RTX 4090 with 24GB, which handles the large model comfortably. If you have less, you'll need to use smaller models or CPU mode (which is painfully slow).

# Create a clean environment
conda create -n audiocraft python=3.10
conda activate audiocraft

# Install PyTorch with CUDA support (adjust CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Clone and install Audiocraft
git clone https://github.com/facebookresearch/audiocraft.git
cd audiocraft
pip install -e .

Basic Usage: MusicGen

Let's start with the Python API. This gives you more control than the CLI.

from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

# Load the model (options: small, medium, large, melody)
model = MusicGen.get_pretrained('facebook/musicgen-large')

# Set generation parameters
model.set_generation_params(
    use_sampling=True,
    top_k=250,
    duration=30  # seconds
)

# Generate from text prompt
prompt = "A lo-fi hip hop beat with jazz piano and soft drums"
wav = model.generate([prompt])

# Save the output
output_path = 'output.wav'
audio_write(output_path, wav[0].cpu(), model.sample_rate)

The facebook/musicgen-large model is about 15GB and needs 16GB+ VRAM. If you're running out of memory, try musicgen-medium (7GB) or musicgen-small (3GB).

Continuation: Extending Audio

One of MusicGen's coolest features is continuation - you can feed it existing audio and it will extend it in the same style. Great for looping beats or extending intros.

from audiocraft.models import MusicGen
import torchaudio

model = MusicGen.get_pretrained('facebook/musicgen-large')
model.set_generation_params(duration=30)

# Load your existing audio
wav, sr = torchaudio.load('existing_track.wav')

# Generate continuation
prompt = "continue the melody with a variation"
output = model.generate_continuation(
    wav[0:1],  # First channel only
    prompt=[prompt],
    continue_at_time=len(wav[0]) / sr  # Continue from the end
)

audio_write('extended.wav', output[0].cpu(), model.sample_rate)

Melody Mode: Guiding Generation

If you have a melody in mind, you can guide MusicGen with a reference audio. This is the "melody" model variant.

from audiocraft.models import MusicGen

# Use the melody model
model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=30)

# Your reference melody (can be a simple humming recording)
melody, sr = torchaudio.load('my_melody.wav')

# Generate music following that melody
prompt = "upbeat electronic dance track with the given melody"
output = model.generate_with_chronological_prompt(
    [prompt],
    melody_wavs=[melody],
    sample_rates=[sr]
)

audio_write('guided_music.wav', output[0].cpu(), model.sample_rate)

Common Problems & Solutions

Issue #487: CUDA Out of Memory with Large Model
github.com/facebookresearch/audiocraft/issues/487

Problem: "RuntimeError: CUDA out of memory. Tried to allocate 2.5GB" when loading the large model on a 16GB GPU.

What I Tried: Reducing batch size to 1, clearing cache with torch.cuda.empty_cache() - neither helped.

Actual Fix: The model loads in full precision by default. Enable half-precision (FP16) and enable CPU offloading for embeddings:

# Enable half-precision
model = MusicGen.get_pretrained('facebook/musicgen-large')

# Convert to half precision before generating
model.model = model.model.half()

# Enable gradient checkpointing to save memory
import torch
model.model.gradient_checkpointing_enable()

# Now generate
output = model.generate([prompt])
Issue #312: "AttributeError: 'NoneType' object has no attribute 'shape'"
github.com/facebookresearch/audiocraft/issues/312

Problem: Getting NoneType errors when generating with continuation on certain audio formats.

What I Tried: Converting to different sample rates, normalizing audio - didn't fix it.

Actual Fix: The continuation feature requires mono audio at exactly 32kHz sample rate. The error occurs when you pass stereo or wrong sample rate:

import torchaudio

# Load and convert to proper format
wav, sr = torchaudio.load('input.wav')

# Resample to 32kHz if needed
if sr != 32000:
    resampler = torchaudio.transforms.Resample(sr, 32000)
    wav = resampler(wav)
    sr = 32000

# Convert to mono if stereo
if wav.shape[0] > 1:
    wav = wav.mean(dim=0, keepdim=True)

# Now continuation will work
output = model.generate_continuation(wav, prompt=[prompt])
Issue #456: Generation Quality Drops After Several Runs
github.com/facebookresearch/audiocraft/issues/456

Problem: First few generations sound great, but after generating 10+ tracks in a loop, the quality degrades - lots of artifacts and static.

What I Tried: Restarting the kernel fixed it temporarily, but that's not practical for batch generation.

Actual Fix: The model's KV cache accumulates across generations and isn't being cleared. Manually reset it between generations:

for i, prompt in enumerate(prompts):
    output = model.generate([prompt])

    # Clear CUDA cache and reset model state
    torch.cuda.empty_cache()

    # Reset the model's internal cache
    if hasattr(model.model, 'reset_cache'):
        model.model.reset_cache()

    audio_write(f'output_{i}.wav', output[0].cpu(), model.sample_rate)
Issue #521: Slow Generation on CPU - "Takes 2 Hours for 30 Seconds"
github.com/facebookresearch/audiocraft/issues/521

Problem: Running on CPU (Mac M1/Intel) is impossibly slow - hours for short clips.

What I Tried: Reducing model size to small, decreasing duration - still too slow.

Actual Fix: CPU generation isn't practical. Use a cloud GPU or optimize with smaller models and shorter duration:

# Use smallest model
model = MusicGen.get_pretrained('facebook/musicgen-small')

# Reduce top_k (faster but less diverse)
model.set_generation_params(
    top_k=50,  # Default is 250
    duration=10  # Shorter duration
)

# Use greedy decoding (fastest, least diverse)
model.set_generation_params(
    use_sampling=False,
    duration=10
)

For production use, consider RunPod or Lambda Labs GPU instances - about $0.50/hour for an RTX 4000 Ada.

Production Tips

Batch Generation

Generate multiple variations at once by passing a list of prompts. This is more efficient than looping:

prompts = [
    "jazz piano with walking bass",
    "jazz piano with walking bass, upbeat tempo",
    "jazz piano with walking bass, minor key",
    "jazz piano with walking bass, swing rhythm"
]

# Generates all in one batch - much faster
outputs = model.generate(prompts)

for i, output in enumerate(outputs):
    audio_write(f'variation_{i}.wav', output.cpu(), model.sample_rate)

Model Selection Guide

Model VRAM Quality Speed
musicgen-small6GBGoodFast
musicgen-medium12GBBetterMedium
musicgen-large16GBBestSlow
musicgen-melody16GBBestSlow

Cost Estimate

For batch generation on cloud GPUs:

Licensing Considerations

Audiocraft code is MIT-licensed, but the MusicGen models have a custom license that restricts commercial use. Check the model card before using in commercial projects. The AudioGen models are CC-BY-NC 4.0 (non-commercial).

AudioGen: Sound Effects

AudioGen works similarly but generates sound effects instead of music:

from audiocraft.models import AudioGen

model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)

prompt = "footsteps on gravel with birds chirping in background"
wav = model.generate([prompt])

audio_write('sfx.wav', wav[0].cpu(), model.sample_rate)

Recommended Reading

Bark: AI Voice Generation

Text-to-speech with emotional expression

GPT-SoVITS: Voice Cloning

Clone voices from short samples

Whisper: Speech Recognition

Transcribe audio with high accuracy

Embedchain: RAG Tutorial

Build AI apps with your data