Getting Started with Meta's Audiocraft
Audiocraft is Meta's open-source toolkit for generating audio and music from text prompts. It includes MusicGen (for music) and AudioGen (for sound effects). Let me walk through getting it running.
Quick Overview
- • MusicGen: Generates music from text descriptions
- • AudioGen: Generates sound effects and environmental audio
- • Requirements: GPU with 16GB+ VRAM recommended
- • License: MIT for models, CC-BY-NC 4.0 for some audio
Installation
First, make sure you have a GPU. The models run on CUDA and need significant VRAM. I'm using an RTX 4090 with 24GB, which handles the large model comfortably. If you have less, you'll need to use smaller models or CPU mode (which is painfully slow).
# Create a clean environment
conda create -n audiocraft python=3.10
conda activate audiocraft
# Install PyTorch with CUDA support (adjust CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Clone and install Audiocraft
git clone https://github.com/facebookresearch/audiocraft.git
cd audiocraft
pip install -e .
Basic Usage: MusicGen
Let's start with the Python API. This gives you more control than the CLI.
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
# Load the model (options: small, medium, large, melody)
model = MusicGen.get_pretrained('facebook/musicgen-large')
# Set generation parameters
model.set_generation_params(
use_sampling=True,
top_k=250,
duration=30 # seconds
)
# Generate from text prompt
prompt = "A lo-fi hip hop beat with jazz piano and soft drums"
wav = model.generate([prompt])
# Save the output
output_path = 'output.wav'
audio_write(output_path, wav[0].cpu(), model.sample_rate)
The facebook/musicgen-large model is about 15GB and needs 16GB+ VRAM. If you're running out of memory, try musicgen-medium (7GB) or musicgen-small (3GB).
Continuation: Extending Audio
One of MusicGen's coolest features is continuation - you can feed it existing audio and it will extend it in the same style. Great for looping beats or extending intros.
from audiocraft.models import MusicGen
import torchaudio
model = MusicGen.get_pretrained('facebook/musicgen-large')
model.set_generation_params(duration=30)
# Load your existing audio
wav, sr = torchaudio.load('existing_track.wav')
# Generate continuation
prompt = "continue the melody with a variation"
output = model.generate_continuation(
wav[0:1], # First channel only
prompt=[prompt],
continue_at_time=len(wav[0]) / sr # Continue from the end
)
audio_write('extended.wav', output[0].cpu(), model.sample_rate)
Melody Mode: Guiding Generation
If you have a melody in mind, you can guide MusicGen with a reference audio. This is the "melody" model variant.
from audiocraft.models import MusicGen
# Use the melody model
model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=30)
# Your reference melody (can be a simple humming recording)
melody, sr = torchaudio.load('my_melody.wav')
# Generate music following that melody
prompt = "upbeat electronic dance track with the given melody"
output = model.generate_with_chronological_prompt(
[prompt],
melody_wavs=[melody],
sample_rates=[sr]
)
audio_write('guided_music.wav', output[0].cpu(), model.sample_rate)
Common Problems & Solutions
Problem: "RuntimeError: CUDA out of memory. Tried to allocate 2.5GB" when loading the large model on a 16GB GPU.
What I Tried: Reducing batch size to 1, clearing cache with torch.cuda.empty_cache() - neither helped.
Actual Fix: The model loads in full precision by default. Enable half-precision (FP16) and enable CPU offloading for embeddings:
# Enable half-precision
model = MusicGen.get_pretrained('facebook/musicgen-large')
# Convert to half precision before generating
model.model = model.model.half()
# Enable gradient checkpointing to save memory
import torch
model.model.gradient_checkpointing_enable()
# Now generate
output = model.generate([prompt])
Problem: Getting NoneType errors when generating with continuation on certain audio formats.
What I Tried: Converting to different sample rates, normalizing audio - didn't fix it.
Actual Fix: The continuation feature requires mono audio at exactly 32kHz sample rate. The error occurs when you pass stereo or wrong sample rate:
import torchaudio
# Load and convert to proper format
wav, sr = torchaudio.load('input.wav')
# Resample to 32kHz if needed
if sr != 32000:
resampler = torchaudio.transforms.Resample(sr, 32000)
wav = resampler(wav)
sr = 32000
# Convert to mono if stereo
if wav.shape[0] > 1:
wav = wav.mean(dim=0, keepdim=True)
# Now continuation will work
output = model.generate_continuation(wav, prompt=[prompt])
Problem: First few generations sound great, but after generating 10+ tracks in a loop, the quality degrades - lots of artifacts and static.
What I Tried: Restarting the kernel fixed it temporarily, but that's not practical for batch generation.
Actual Fix: The model's KV cache accumulates across generations and isn't being cleared. Manually reset it between generations:
for i, prompt in enumerate(prompts):
output = model.generate([prompt])
# Clear CUDA cache and reset model state
torch.cuda.empty_cache()
# Reset the model's internal cache
if hasattr(model.model, 'reset_cache'):
model.model.reset_cache()
audio_write(f'output_{i}.wav', output[0].cpu(), model.sample_rate)
Problem: Running on CPU (Mac M1/Intel) is impossibly slow - hours for short clips.
What I Tried: Reducing model size to small, decreasing duration - still too slow.
Actual Fix: CPU generation isn't practical. Use a cloud GPU or optimize with smaller models and shorter duration:
# Use smallest model
model = MusicGen.get_pretrained('facebook/musicgen-small')
# Reduce top_k (faster but less diverse)
model.set_generation_params(
top_k=50, # Default is 250
duration=10 # Shorter duration
)
# Use greedy decoding (fastest, least diverse)
model.set_generation_params(
use_sampling=False,
duration=10
)
For production use, consider RunPod or Lambda Labs GPU instances - about $0.50/hour for an RTX 4000 Ada.
Production Tips
Batch Generation
Generate multiple variations at once by passing a list of prompts. This is more efficient than looping:
prompts = [
"jazz piano with walking bass",
"jazz piano with walking bass, upbeat tempo",
"jazz piano with walking bass, minor key",
"jazz piano with walking bass, swing rhythm"
]
# Generates all in one batch - much faster
outputs = model.generate(prompts)
for i, output in enumerate(outputs):
audio_write(f'variation_{i}.wav', output.cpu(), model.sample_rate)
Model Selection Guide
| Model | VRAM | Quality | Speed |
|---|---|---|---|
| musicgen-small | 6GB | Good | Fast |
| musicgen-medium | 12GB | Better | Medium |
| musicgen-large | 16GB | Best | Slow |
| musicgen-melody | 16GB | Best | Slow |
Cost Estimate
For batch generation on cloud GPUs:
- RunPod RTX 4000 Ada: ~$0.44/hr → ~100 tracks per hour
- Lambda Labs A100: ~$1.49/hr → ~200 tracks per hour
- Local RTX 4090: electricity cost ~$0.10/hr
Licensing Considerations
Audiocraft code is MIT-licensed, but the MusicGen models have a custom license that restricts commercial use. Check the model card before using in commercial projects. The AudioGen models are CC-BY-NC 4.0 (non-commercial).
AudioGen: Sound Effects
AudioGen works similarly but generates sound effects instead of music:
from audiocraft.models import AudioGen
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)
prompt = "footsteps on gravel with birds chirping in background"
wav = model.generate([prompt])
audio_write('sfx.wav', wav[0].cpu(), model.sample_rate)
Recommended Reading
Text-to-speech with emotional expression
Clone voices from short samples
Transcribe audio with high accuracy
Build AI apps with your data