BabyAGI: Built an AI Agent That Actually Does Stuff

Tried AutoGPT? Too complicated. BabyAGI actually works. Creates tasks, prioritizes them, executes them. Here's how I got it running.

How I ended up here

Kept seeing tweets about "autonomous AI agents" that do work for you. AutoGPT was the big one - sounded amazing. Give it a goal, it figures out what to do, does it, keeps going until it's done.

Tried AutoGPT. What a nightmare. Kept getting stuck in loops, burning through API credits, sometimes doing nothing for hours. Gave up after $50 in OpenAI bills.

Then someone pointed me to BabyAGI. Simpler approach - creates tasks, prioritizes them, executes them one by one. Actually manageable.

What BabyAGI actually does

You give it an objective. It breaks it down into tasks, figures out which to do first, executes that task, learns from it, then moves to the next one. Keeps going until it hits your goal.

Not magic - it loops through: create tasks → prioritize → execute → repeat. But it actually works.

So what is BabyAGI

BabyAGI is a Python script by Yohei Nakajima. It's simpler than AutoGPT but does the same core thing - autonomous task execution using LLMs.

The loop:

1. You give it an objective
   ↓
2. Creates list of tasks to achieve objective
   ↓
3. Prioritizes tasks (what's most important)
   ↓
4. Executes first task
   ↓
5. Stores result in memory
   ↓
6. Goes back to step 2 with new context
   ↓
Repeat until objective is complete

What makes it different from AutoGPT:

  • Simpler architecture: Less stuff that can break
  • Sequential execution: Does one thing at a time, not parallel chaos
  • Better memory: Uses vector database to remember what it learned
  • More predictable: You can actually follow what it's doing
  • Python script: Easy to modify and extend

It's still experimental. Sometimes goes off track, sometimes wastes API calls on useless tasks. But when it works, it's pretty cool.

Setting it up

Prerequisites

You'll need:

  • Python 3.9 or higher
  • OpenAI API key (GPT-4 recommended, GPT-3.5 works but worse)
  • Pinecone API key (for vector memory - free tier works)
  • Some patience and API budget

Clone and install

git clone https://github.com/yoheinakajima/babyagi.git
cd babyagi
pip install -r requirements.txt

Configure environment

Create a .env file:

OPENAI_API_KEY=sk-your-openai-key
OPENAI_API_MODEL=gpt-4  # or gpt-3.5-turbo to save money
TABLE_NAME=task_store
PINECONE_API_KEY=your-pinecone-key
PINECONE_ENVIRONMENT=us-east1-aws

Set up Pinecone

BabyAGI needs a vector database for memory. Pinecone is easiest:

# Go to pinecone.io, sign up (free tier)
# Create an index called "task-store"
# Copy your API key and environment into .env

GPT-4 is way better but costs 20x more. Start with GPT-3.5 to test, upgrade to GPT-4 for real work.

Running BabyAGI

First run

Edit the babyagi.py file to set your objective:

# At the top of babyagi.py
OBJECTIVE = "Solve world hunger"  # Start with something simple
YOUR_FIRST_TASK = "Develop a task list"

Then run it:

python babyagi.py

Watch it go. It'll start generating tasks, prioritizing them, executing them. Each loop costs API money so keep an eye on it.

What I used it for first

Started simple - "Write a blog post about productivity":

OBJECTIVE = "Write a comprehensive blog post about productivity tips for remote workers"
YOUR_FIRST_TASK = "Research current productivity trends"

What it actually did:

Loop 1:
- Created tasks: research, outline, write, edit, publish
- Prioritized research as first task
- Executed: searched for productivity articles, summarized findings

Loop 2:
- Created new tasks based on research
- Prioritized creating outline
- Executed: generated blog post outline

Loop 3:
- Prioritized writing first section
- Executed: wrote introduction

...and so on

Took about 30 minutes and ~$15 in API costs. Result was actually decent - not amazing, but usable.

Stuff I've actually used it for

Market research

Told it "Research competitors for my SaaS idea". It spent 4 hours searching, analyzing, compiling. Got a 20-page report with competitive analysis, feature gaps, pricing strategies. Was actually useful.

Content creation

"Create a week's worth of Twitter content about AI". Generated threads, researched hashtags, scheduled posts. Saved me maybe 5 hours of work.

Code research

"Find the best Python libraries for web scraping". It searched, compared, tested examples, gave me a ranked list with pros/cons. Faster than doing it myself.

Learning projects

"Create a study plan for learning machine learning". Built a curriculum, found resources, created exercises. Pretty solid roadmap.

Making it actually useful

Limit iterations

Otherwise it'll run forever (and cost infinite money):

# In babyagi.py
MAX_LOOPS = 25  # Stop after 25 iterations
# Default was unlimited, which is dangerous

Custom tools

BabyAGI can use external tools. I added web search:

# Add custom execution function
def web_search(query):
    # Uses serpapi or similar
    results = search_api.search(query)
    return str(results)

# BabyAGI will call this when it needs to search

Better prompting

The objective matters. Vague = bad results:

# Too vague
OBJECTIVE = "Make me money"

# Better
OBJECTIVE = "Research affiliate marketing programs for tech blogs and create a strategy for my programming blog"

# Even better with constraints
OBJECTIVE = """Research affiliate marketing programs for tech blogs.
Focus on programming tools and developer products.
Budget: $0 (no paid tools).
Deliverable: 3-page strategy with specific programs and commission rates."""

Cost monitoring

Add spending alerts:

# Track API costs
import tiktoken

def estimate_cost(text, model="gpt-4"):
    enc = tiktoken.encoding_for_model(model)
    tokens = len(enc.encode(text))
    if model == "gpt-4":
        return tokens * 0.00003  # $0.03 per 1k tokens
    return tokens * 0.000002     # GPT-3.5 pricing

# Add to loop
total_cost += estimate_cost(prompt + response)
if total_cost > MAX_BUDGET:
    print("Budget exceeded!")
    break

Things that went wrong

Infinite loops

Kept doing the same task over and over.

# Fix: Added loop counter and MAX_LOOPS limit
# Also made task completion more strict

Hallucinations

Made up facts, claimed to do things it didn't.

# Fix: Added verification step
# Task must show proof of completion
# Switched to GPT-4 (less hallucination)

API costs exploded

One run cost $80 before I noticed.

# Fix: Strict MAX_LOOPS, cost tracking
# Started with GPT-3.5 for testing
# Only use GPT-4 for final runs

Lost context

Forgot what it was doing, went off track.

# Fix: Improved memory storage in Pinecone
# Added objective reminder in each loop
# Better task prioritization prompt

Tasks too vague

"Research" task returned nothing useful.

# Fix: Better initial objective
# More specific task descriptions
# Added constraints and deliverables

BabyAGI vs AutoGPT vs doing it yourself

BabyAGI AutoGPT Manual
Complexity Moderate High Low
Reliability ★★★☆☆ ★★☆☆☆ ★★★★★
Setup time 30 min 1-2 hours 0 min
API costs $10-50/run $20-100/run $0
Predictable Mostly Not really Yes
Can edit Yes (Python) Yes (Python) N/A
Best for Research & content Complex workflows Everything

BabyAGI is more predictable than AutoGPT but still experimental. I use it for research and ideation, not for anything critical.

Is it actually useful?

Sort of? It's definitely not the "AI does everything for you" that the hype suggests. But for certain tasks - research, content drafts, brainstorming - it can save real time.

The key is having realistic expectations. It's not going to build your startup or solve complex problems autonomously. But it can do the boring research phase, generate first drafts, compile information.

I treat it like a research assistant that needs very specific instructions. Give it a clear objective, constraints, and check on it periodically. It'll surprise you sometimes.

Main thing is watching the costs. GPT-4 adds up fast. Start with GPT-3.5, limit the loops, monitor spending.

Links: github.com/yoheinakajima/babyagi | Original thread: Twitter thread