AI Agent Engineering

Chapter 2

Conversation

The code we wrote in Chapter 1 has no memory and every call to LLM is a fresh start. If you ask the model your name, then in a second call ask "what did I just tell you," it has no idea.

Some products go further and remember things about you across entirely separate sessions: your name, your job, the project you mentioned last week. That continuity is what makes them feel like a real assistant rather than a fancy autocomplete.

In this chapter we will fold our current code into a small CLI tool that holds a conversation across turns by accumulating messages in a list.

The model is stateless

In Chapter 1 we already briefly mentioned that the model has no memory. Each call to messages.create is a function from a list of messages to a reply, and forgets everything the moment it returns [1]. Re-sending the whole history every turn is, on its face, wasteful, and we will pay for it in tokens. But it is also what makes the API trivial to reason about because the model has no hidden state.

The accumulation pattern

The pattern under creation of the conversation looks like this:

1. Start with an empty list.
2. User says something — append a {"role": "user", "content": ...} entry.
3. Send the whole list to the model, get a reply back.
4. Append a {"role": "assistant", "content": reply} entry.
5. Go to step 2.

Some important concepts to notice here:

  • The model never sees its own previous responses as assistant events generated by some external system. It sees them as part of the input in the same way it sees user messages.

  • There is no concept of a "session ID." The conversation is the list.

  • Roles strictly alternate. The Anthropic API combines consecutive same-role messages for you [2]. Other providers vary, with some rejecting that shape outright. Our pattern alternates by construction, so we will not have to worry about this until we start receiving tool calls in Chapter 7.

Building the CLI

Let's come back to main.py and wrap our existing call to the model into an llm function like this:

def llm(messages):
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=messages,
    )
    return response.content[0].text

Two small improvements before we move on. A docstring — a one-line summary at the top of the function, following PEP 257 [3] — explains what the function does to anyone reading it, and tools like Sphinx [4] can turn those into browsable HTML docs. And type annotations, while not enforced at runtime, document expectations at the call site and let static checkers like mypy catch a class of bugs before you run the code [5]. Both together yield the following result:

def llm(messages: list[dict]) -> str:
    """Send a list of messages to the model and return the reply text."""
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=messages,
    )
    return response.content[0].text

From here on we will assume both practices throughout the book without calling them out each time.

Now let's add a chat() function that handles the conversation loop, implementing the algorithm described above:

def chat() -> None:
    """Run an interactive chat loop, accumulating turns in a single messages list."""
    messages: list[dict] = []
    while True:
        user_input = input()
        messages.append({"role": "user", "content": user_input})
        reply = llm(messages)
        messages.append({"role": "assistant", "content": reply})
 
        print(f"assistant: {reply}\n")
 

Then call chat() from the entry point:

if __name__ == "__main__":
    chat()

The if __name__ == "__main__": guard is a standard Python idiom: the body runs only when the file is executed directly (e.g., uv run main.py), not when it is imported by another module. This becomes important once your project grows beyond a single file. The Python docs page on __main__ is a short read [6].

Let's run our program and see how it works:

$ uv run main.py 
Hi there!
assistant: Hi there! How are you doing today? 😊 Is there something I can help you with?
 
My name is Olga, what is your name?
assistant: Nice to meet you, Olga! My name is Claude. I'm an AI assistant made by Anthropic. How can I help you today? 😊
 
What is my name?
assistant: Your name is Olga! You just told me a moment ago. 😊 Is there anything else you'd like to chat about or anything I can help you with?
 
^D
$

Our chat now remembers the conversation, but two edge cases will crash or waste tokens. The user might press Ctrl-D (or Ctrl-C) to exit, which currently raises an exception. The user might also submit an empty line by mistake, which sends a useless message to the model.

Let's also a one-line greeting at startup so a fresh reader knows what they are looking at, and a you: prefix on the input prompt so the transcript is easy to follow.

def chat() -> None:
    """Run an interactive chat loop, accumulating turns in a single messages list."""
    messages: list[dict] = []
    print("chat — Ctrl-D or empty line to exit\n")  # <-- new: greeting
    while True:
        try:                                           # <-- new: clean exit on Ctrl-D / Ctrl-C
            user_input = input("you: ").strip()
        except (EOFError, KeyboardInterrupt):
            print()
            break
        if not user_input:                             # <-- new: skip empty submissions
            break
        messages.append({"role": "user", "content": user_input})
        reply = llm(messages)
        messages.append({"role": "assistant", "content": reply})
 
        print(f"assistant: {reply}\n")

Watching the list grow

Every turn appends two entries. After the three exchanges from the run above, our messages list looks like this:

[
    {"role": "user",      "content": "Hi there!"},
    {"role": "assistant", "content": "Hi there! How are you doing today? ..."},
    {"role": "user",      "content": "My name is Olga, what is your name?"},
    {"role": "assistant", "content": "Nice to meet you, Olga! My name is Claude. ..."},
    {"role": "user",      "content": "What is my name?"},
    {"role": "assistant", "content": "Your name is Olga! You just told me a moment ago. ..."},
]

If you add print(messages) at the top of the loop, you can watch this list grow as you talk. Notice that each assistant reply is appended as plain text, indistinguishable in shape from the user messages above it.

The context window is where memory ends

The model can read every message in the list, but only up to the moment it reaches the context window. Context windows vary from 100k to 1M tokens. A million tokens is roughly seven hundred and fifty thousand English words, which is about the length of a long novel, every single time you make a request.

You will hit this limit if you:

  • Paste large files into the conversation.
  • Run an agent that calls many tools and accumulates large tool results in the history.
  • Let a long-running agent talk to itself for hours or days.
  • Or do all three at once, which is what real agents look like in production.

When you exceed the context window, the API does not silently truncate older messages for you. It returns a 400 error: prompt is too long. This is a good default because silent truncation would corrupt conversations in subtle, hard-to-debug ways. But it is also something we will eventually have to handle.

Chapter 15 introduces session memory: the layer that watches the conversation grow and summarizes old turns into a shorter form before they overflow the window. Chapters 16 and 17 then add long-term memory, allowing the agent to remember things between separate conversations, after the in-conversation list has been thrown away.

The cost shape of a conversation

Beyond the hard limit of the context window, there is a soft cost: every turn re-sends the entire history. On turn N of a conversation, you pay to process all the tokens from turns 1 through N−1, plus the new user message, plus generating the new reply.

Plotted out, the input-token cost of an N-turn conversation goes like:

turn 1: 1 message
turn 2: 3 messages (turn 1's user, turn 1's assistant, turn 2's user)
turn 3: 5 messages
...
turn N: 2N - 1 messages

The total input-token cost across N turns is therefore proportional to N², not to N. A conversation that is twice as long costs roughly four times as much. This is fine for short interactions but becomes a real problem when an agent is doing something multi-step over hundreds of turns.

Two things eventually save us from this:

  1. Prompt caching. Most providers offer a cache: if you re-send a prompt prefix that was sent recently, you pay roughly 10% of the normal input-token rate for the cached portion. We will turn this on in Chapter 4 once we have streaming in place.

  2. Memory layers that compress old turns instead of carrying every word forward. This will be covered in Chapters 15 through 17.

For Chapter 2 the takeaway is just: a long conversation is expensive and slow, and the cost grows quadratically. Notice it in the tokens-per-turn report from Exercise 2.

Production reference

In nanobot, the equivalent of our messages list lives inside nanobot/nanobot/session/manager.py. The file defines two classes that, together, do the work our chat() function does inline today:

  • Session wraps the messages list with a few extra fields a real agent needs: a key ("telegram:12345" or "cli:default" — which channel and which user), created_at and updated_at timestamps, a metadata dict for per-session state, and last_consolidated, an index used by the memory layers in later chapters to remember how much of the history has already been summarized.
  • SessionManager is the layer above. It owns a _cache dict mapping keys to Session objects, so multiple turns from the same user hit the same session in memory. It knows how to load a session from disk (_load) and save one back (save) and get_or_create(key) is the single entry point an incoming message goes through.

Three methods are worth tracing once you have written your own version. Each maps directly to something we wrote or discussed in this chapter:

  • Session.add_message(role, content, **kwargs) is the production version of our messages.append({"role": ..., "content": ...}). It additionally stamps each message with a timestamp and accepts arbitrary **kwargs (for image attachments, tool-call IDs, channel-delivery flags) without changing the call site. Same shape, more room.
  • Session.get_history(max_messages=120, max_tokens=0, ...) is what we will eventually need instead of "send the entire list every turn." It slices the tail of the conversation to a message count, then optionally to a token budget, and is careful never to start the slice on an assistant turn or on an orphan tool result — both of which would confuse the model. Chapter 15 builds the version that decides what to send.
  • SessionManager.save(session, fsync=False) persists a session to a JSONL file. The pattern worth noticing is the atomic write: it writes to a .jsonl.tmp file first, then os.replace()s it onto the real path. That keeps the on-disk file from ever being half-written if the process is killed mid-save.

Two production nuances worth flagging now, because they are invisible from a single-user CLI but inevitable the moment a real agent serves more than one conversation:

  1. Sessions are keyed, not global. Our messages list is a single Python local variable but nanobot's sessions are looked up by channel:chat_id. Even a personal agent ends up with several conversation threads going at once: the CLI you are using to test it, your Telegram chat with it, a scheduled cron job that wakes the agent up to summarize your morning. Without it, a Telegram message and a cron-triggered turn would land in the same list, and the model would see one as a continuation of the other.
  2. Persistence is JSONL, not JSON. JSONL (JSON Lines) is a small convention: each line of the file is its own self-contained JSON object, separated by newlines, instead of the whole file being one big JSON array [7]. In nanobot's case, each message is one such line, with a single metadata line at the top of the file. The first benefit is that a crashed or killed write loses at most the last line, not the whole file. Also, you can tail -f a live session file in another terminal and watch it grow message by message, which is a surprisingly useful debugging tool once an agent is running on its own.

The closest single file to read after this chapter is nanobot/nanobot/session/manager.py. Try to recognize, on first read, the two-line append pattern from our chat() loop hiding inside Session.add_message.

Exercises

  1. Watch the list grow. Add print(f"--- {len(messages)} messages so far ---") at the top of the chat loop. Run a 10-turn conversation. Each turn, the count should grow by exactly two.

  2. Watch the cost grow. Modify llm to also return response.usage and have chat print usage.input_tokens after every reply. Run a conversation of at least 8 turns. Plot the numbers. They should grow roughly linearly with the turn number and the slope is the average tokens added per turn (your message + the model's reply). Then you can multiply it by the conversation length and you get the quadratic total.

  3. Wrap your messages list in a Session. Replace the bare messages: list[dict] = [] in chat with a small dataclass that records key: str, messages: list[dict], created_at: datetime, and updated_at: datetime, and exposes an add_message(role, content) method that appends and bumps updated_at. Use key="cli:default" for now. Then open Session in nanobot/nanobot/session/manager.py and identify two fields and one method your version is missing. We will elaborate on this in chapters 7 and 15.

  4. Stretch: slash commands. Extend chat to recognize input that starts with /. Implement /reset (clear messages and start over), /undo (pop the last user/assistant pair), and /print (display the current messages list). Slash commands are a small but real piece of agent UX and most production assistants have them.

  5. Stretch: durable sessions. Add /save <filename> and /load <filename> slash commands that persist the messages list to disk and read it back. Then, for the production version of the same idea, write to a .tmp file and os.replace() it onto the final path so that a crash mid-save can never leave a half-written file. Compare your code to SessionManager.save in nanobot/nanobot/session/manager.py. While you are there, notice that nanobot writes one JSON object per line (JSONL) rather than one big JSON array and think about why that matters when sessions get long.

References

[1] Using the Messages API — Multiple conversational turns. Claude API documentation. https://platform.claude.com/docs/en/build-with-claude/working-with-messages

[2] Messages — API reference. Claude API documentation. https://platform.claude.com/docs/en/api/messages

[3] PEP 257 — Docstring Conventions. https://peps.python.org/pep-0257/

[4] Sphinx — Python Documentation Generator. https://www.sphinx-doc.org/

[5] Google Python Style Guide — Type Annotations. https://google.github.io/styleguide/pyguide.html#319-type-annotations

[6] __main__ — Top-level code environment. Python documentation. https://docs.python.org/3/library/__main__.html

[7] JSON Lines. https://jsonlines.org/