Getting Started with Meta's Audiocraft

Audiocraft is Meta's open-source toolkit for generating audio and music from text prompts. It includes MusicGen (for music) and AudioGen (for sound effects). Let me walk through getting it running.

Quick Overview

• MusicGen: Generates music from text descriptions
• AudioGen: Generates sound effects and environmental audio
• Requirements: GPU with 16GB+ VRAM recommended
• License: MIT for models, CC-BY-NC 4.0 for some audio

Installation

First, make sure you have a GPU. The models run on CUDA and need significant VRAM. I'm using an RTX 4090 with 24GB, which handles the large model comfortably. If you have less, you'll need to use smaller models or CPU mode (which is painfully slow).

# Create a clean environment
conda create -n audiocraft python=3.10
conda activate audiocraft

# Install PyTorch with CUDA support (adjust CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Clone and install Audiocraft
git clone https://github.com/facebookresearch/audiocraft.git
cd audiocraft
pip install -e .

Basic Usage: MusicGen

Let's start with the Python API. This gives you more control than the CLI.

from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

# Load the model (options: small, medium, large, melody)
model = MusicGen.get_pretrained('facebook/musicgen-large')

# Set generation parameters
model.set_generation_params(
    use_sampling=True,
    top_k=250,
    duration=30  # seconds
)

# Generate from text prompt
prompt = "A lo-fi hip hop beat with jazz piano and soft drums"
wav = model.generate([prompt])

# Save the output
output_path = 'output.wav'
audio_write(output_path, wav[0].cpu(), model.sample_rate)

The facebook/musicgen-large model is about 15GB and needs 16GB+ VRAM. If you're running out of memory, try musicgen-medium (7GB) or musicgen-small (3GB).

Continuation: Extending Audio

One of MusicGen's coolest features is continuation - you can feed it existing audio and it will extend it in the same style. Great for looping beats or extending intros.

from audiocraft.models import MusicGen
import torchaudio

model = MusicGen.get_pretrained('facebook/musicgen-large')
model.set_generation_params(duration=30)

# Load your existing audio
wav, sr = torchaudio.load('existing_track.wav')

# Generate continuation
prompt = "continue the melody with a variation"
output = model.generate_continuation(
    wav[0:1],  # First channel only
    prompt=[prompt],
    continue_at_time=len(wav[0]) / sr  # Continue from the end
)

audio_write('extended.wav', output[0].cpu(), model.sample_rate)

Melody Mode: Guiding Generation

If you have a melody in mind, you can guide MusicGen with a reference audio. This is the "melody" model variant.

from audiocraft.models import MusicGen

# Use the melody model
model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=30)

# Your reference melody (can be a simple humming recording)
melody, sr = torchaudio.load('my_melody.wav')

# Generate music following that melody
prompt = "upbeat electronic dance track with the given melody"
output = model.generate_with_chronological_prompt(
    [prompt],
    melody_wavs=[melody],
    sample_rates=[sr]
)

audio_write('guided_music.wav', output[0].cpu(), model.sample_rate)

Common Problems & Solutions

Issue #487: CUDA Out of Memory with Large Model

github.com/facebookresearch/audiocraft/issues/487

Problem: "RuntimeError: CUDA out of memory. Tried to allocate 2.5GB" when loading the large model on a 16GB GPU.

What I Tried: Reducing batch size to 1, clearing cache with torch.cuda.empty_cache() - neither helped.

Actual Fix: The model loads in full precision by default. Enable half-precision (FP16) and enable CPU offloading for embeddings:

# Enable half-precision
model = MusicGen.get_pretrained('facebook/musicgen-large')

# Convert to half precision before generating
model.model = model.model.half()

# Enable gradient checkpointing to save memory
import torch
model.model.gradient_checkpointing_enable()

# Now generate
output = model.generate([prompt])

Issue #312: "AttributeError: 'NoneType' object has no attribute 'shape'"

github.com/facebookresearch/audiocraft/issues/312

Problem: Getting NoneType errors when generating with continuation on certain audio formats.

What I Tried: Converting to different sample rates, normalizing audio - didn't fix it.

Actual Fix: The continuation feature requires mono audio at exactly 32kHz sample rate. The error occurs when you pass stereo or wrong sample rate:

import torchaudio

# Load and convert to proper format
wav, sr = torchaudio.load('input.wav')

# Resample to 32kHz if needed
if sr != 32000:
    resampler = torchaudio.transforms.Resample(sr, 32000)
    wav = resampler(wav)
    sr = 32000

# Convert to mono if stereo
if wav.shape[0] > 1:
    wav = wav.mean(dim=0, keepdim=True)

# Now continuation will work
output = model.generate_continuation(wav, prompt=[prompt])

Issue #456: Generation Quality Drops After Several Runs

github.com/facebookresearch/audiocraft/issues/456

Problem: First few generations sound great, but after generating 10+ tracks in a loop, the quality degrades - lots of artifacts and static.

What I Tried: Restarting the kernel fixed it temporarily, but that's not practical for batch generation.

Actual Fix: The model's KV cache accumulates across generations and isn't being cleared. Manually reset it between generations:

for i, prompt in enumerate(prompts):
    output = model.generate([prompt])

    # Clear CUDA cache and reset model state
    torch.cuda.empty_cache()

    # Reset the model's internal cache
    if hasattr(model.model, 'reset_cache'):
        model.model.reset_cache()

    audio_write(f'output_{i}.wav', output[0].cpu(), model.sample_rate)

Issue #521: Slow Generation on CPU - "Takes 2 Hours for 30 Seconds"

github.com/facebookresearch/audiocraft/issues/521

Problem: Running on CPU (Mac M1/Intel) is impossibly slow - hours for short clips.

What I Tried: Reducing model size to small, decreasing duration - still too slow.

Actual Fix: CPU generation isn't practical. Use a cloud GPU or optimize with smaller models and shorter duration:

# Use smallest model
model = MusicGen.get_pretrained('facebook/musicgen-small')

# Reduce top_k (faster but less diverse)
model.set_generation_params(
    top_k=50,  # Default is 250
    duration=10  # Shorter duration
)

# Use greedy decoding (fastest, least diverse)
model.set_generation_params(
    use_sampling=False,
    duration=10
)

For production use, consider RunPod or Lambda Labs GPU instances - about $0.50/hour for an RTX 4000 Ada.

Production Tips

Batch Generation

Generate multiple variations at once by passing a list of prompts. This is more efficient than looping:

prompts = [
    "jazz piano with walking bass",
    "jazz piano with walking bass, upbeat tempo",
    "jazz piano with walking bass, minor key",
    "jazz piano with walking bass, swing rhythm"
]

# Generates all in one batch - much faster
outputs = model.generate(prompts)

for i, output in enumerate(outputs):
    audio_write(f'variation_{i}.wav', output.cpu(), model.sample_rate)

Model Selection Guide

Model	VRAM	Quality	Speed
musicgen-small	6GB	Good	Fast
musicgen-medium	12GB	Better	Medium
musicgen-large	16GB	Best	Slow
musicgen-melody	16GB	Best	Slow

Cost Estimate

For batch generation on cloud GPUs:

RunPod RTX 4000 Ada: ~$0.44/hr → ~100 tracks per hour
Lambda Labs A100: ~$1.49/hr → ~200 tracks per hour
Local RTX 4090: electricity cost ~$0.10/hr

Licensing Considerations

Audiocraft code is MIT-licensed, but the MusicGen models have a custom license that restricts commercial use. Check the model card before using in commercial projects. The AudioGen models are CC-BY-NC 4.0 (non-commercial).

AudioGen: Sound Effects

AudioGen works similarly but generates sound effects instead of music:

from audiocraft.models import AudioGen

model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)

prompt = "footsteps on gravel with birds chirping in background"
wav = model.generate([prompt])

audio_write('sfx.wav', wav[0].cpu(), model.sample_rate)