Conversation · AI Agent Engineering

The code written in Chapter 1 has no memory and every call to the LLM is a fresh start.

This chapter folds the current code into a small CLI tool that holds a conversation across turns by accumulating messages in a list.

The accumulation pattern

The pattern behind building a conversation looks like this:

1. Start with an empty list.
2. User says something — append a {"role": "user", "content": ...} entry.
3. Send the whole list to the model, get a reply back.
4. Append a {"role": "assistant", "content": reply} entry.
5. Go to step 2.

A few important concepts are worth noticing here:

The model never sees its own previous responses as assistant events generated by some external system. It sees them as part of the input, the same way it sees user messages.
There is no concept of a "session ID." The conversation is the list.
Roles strictly alternate. The Anthropic API combines consecutive same-role messages automatically [2]. Other providers vary, with some rejecting that shape outright. This pattern alternates by construction, so the issue stays out of the way until tool calls arrive in Chapter 7.

Building the CLI

Back in main.py, the existing call to the model can be wrapped into an llm function like this:

def llm(messages):
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=messages,
    )
    return response.content[0].text

One improvement that should be introduced is a docstring. It is a one-line summary at the top of the function, following PEP 257 [3], that explains what the function does to anyone reading it. Moreover, tools like Sphinx [4] can turn those into browsable HTML docs. Also, type annotations, while not enforced at runtime, document expectations at the call site and let static checkers like mypy catch a class of bugs before the code runs [5]. Both together yield the following result:

def llm(messages: list[dict]) -> str:
    """Send a list of messages to the model and return the reply text."""
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=messages,
    )
    return response.content[0].text

From here on both practices are assumed throughout the book without being called out.

Next comes a chat() function that handles the conversation loop, implementing the algorithm described in the previous section:

def chat() -> None:
    """Run an interactive chat loop, accumulating turns in a single messages list."""
    messages: list[dict] = []
    while True:
        user_input = input()
        messages.append({"role": "user", "content": user_input})
        reply = llm(messages)
        messages.append({"role": "assistant", "content": reply})
 
        print(f"assistant: {reply}\n")

Then chat() is called from the entry point:

if __name__ == "__main__":
    chat()

Running the program shows how it works:

$ uv run main.py 
Hi there!
assistant: Hi there! How are you doing today? 😊 Is there something I can help you with?
 
My name is Olga, what is your name?
assistant: Nice to meet you, Olga! My name is Claude. I'm an AI assistant made by Anthropic. How can I help you today? 😊
 
What is my name?
assistant: Your name is Olga! You just told me a moment ago. 😊 Is there anything else you'd like to chat about or anything I can help you with?
 
^D
$

The chat now remembers the conversation, but two edge cases will crash or waste tokens. The user might press Ctrl-D (or Ctrl-C) to exit, which currently raises an exception. The user might also submit an empty line by mistake, which sends a useless message to the model. The following modifications address these issues:

def chat() -> None:
    """Run an interactive chat loop, accumulating turns in a single messages list."""
    messages: list[dict] = []
    print("chat — Ctrl-D or empty line to exit\n")  # <-- new: greeting
    while True:
        try:                                           # <-- new: clean exit on Ctrl-D / Ctrl-C
            user_input = input("you: ").strip()
        except (EOFError, KeyboardInterrupt):
            print()
            break
        if not user_input:                             # <-- new: skip empty submissions
            break
        messages.append({"role": "user", "content": user_input})
        reply = llm(messages)
        messages.append({"role": "assistant", "content": reply})
 
        print(f"assistant: {reply}\n")

Also, a one-line greeting at startup helps a fresh reader know what they are looking at, and a you: prefix on the input prompt makes the transcript easy to follow.

Watching the list grow

Every turn appends two entries. After the three exchanges from the run above, the messages list looks like this:

[
    {"role": "user",      "content": "Hi there!"},
    {"role": "assistant", "content": "Hi there! How are you doing today? ..."},
    {"role": "user",      "content": "My name is Olga, what is your name?"},
    {"role": "assistant", "content": "Nice to meet you, Olga! My name is Claude. ..."},
    {"role": "user",      "content": "What is my name?"},
    {"role": "assistant", "content": "Your name is Olga! You just told me a moment ago. ..."},
]

Adding print(messages) at the top of the loop makes it possible to watch this list grow turn by turn. Notice that each assistant reply is appended as plain text, indistinguishable in shape from the user messages above it.

The context window is where memory ends

The model can read every message in the list, but only up to the moment it reaches the context window. Context windows vary from 100k to 1M tokens. A million tokens is roughly seven hundred and fifty thousand English words, which is about the length of a long novel, every single time a request is made.

This limit gets hit when a conversation:

Pastes large files into the history.
Runs an agent that calls many tools and accumulates large tool results.
Lets a long-running agent talk to itself for hours or days.

When the context window is exceeded, the API does not silently truncate older messages. It returns a 400 error: prompt is too long. This is a good default, because silent truncation would corrupt conversations in subtle, hard-to-debug ways. Chapter 15 introduces session memory: the layer that watches the conversation grow and summarizes old turns into a shorter form before they overflow the window. Chapters 16 and 17 then add long-term memory, allowing the agent to remember things between separate conversations, after the in-conversation list has been thrown away.

The cost shape of a conversation

Beyond the hard limit of the context window, there is a soft cost: every turn re-sends the entire history. On turn N of a conversation, the bill covers all the tokens from turns 1 through N−1, plus the new user message, plus generating the new reply.

Plotted out, the input-token cost of an N-turn conversation goes like:

turn 1: 1 message
turn 2: 3 messages (turn 1's user, turn 1's assistant, turn 2's user)
turn 3: 5 messages
...
turn N: 2N - 1 messages

The total input-token cost across N turns is therefore proportional to N², not to N. A conversation that is twice as long costs roughly four times as much. This is fine for short interactions but becomes a real problem when an agent is doing something multi-step over hundreds of turns.

Prompt caching (Chapter 4) and memory layers (Chapters 15-17) adress these issues. For now, the takeaway is just that a long conversation is expensive and slow, and the cost grows quadratically. It shows up in the tokens-per-turn report from Exercise 2.

Production reference

In nanobot, the equivalent of the messages list lives inside nanobot/nanobot/session/manager.py. The file defines two classes that, together, do the work the chat() function does inline today:

Session wraps the messages list with a few extra fields: a key ("telegram:12345" or "cli:default" — which channel and which user), created_at and updated_at timestamps, a metadata dict for per-session state, and last_consolidated, an index used by the memory layers in later chapters to remember how much of the history has already been summarized.
SessionManager is the layer above. It owns a _cache dict mapping keys to Session objects, so multiple turns from the same user hit the same session in memory. It knows how to load a session from disk (_load) and save one back (save), and get_or_create(key) is the single entry point an incoming message goes through.

Pay attention to the following methods and map them onto the implementation from this chapter:

Session.add_message(role, content, **kwargs) is the production version of messages.append({"role": ..., "content": ...}). It additionally stamps each message with a timestamp and accepts arbitrary **kwargs (for image attachments, tool-call IDs, channel-delivery flags) without changing the call site.
Session.get_history(max_messages=120, max_tokens=0, ...) is what eventually replaces "send the entire list every turn." It slices the tail of the conversation to a message count, then optionally to a token budget, and is careful never to start the slice on an assistant turn or on an orphan tool result.
SessionManager.save(session, fsync=False) persists a session to a JSONL file. The pattern worth noticing is the atomic write: it writes to a .jsonl.tmp file first, then os.replace()s it onto the real path. That keeps the on-disk file from ever being half-written if the process is killed mid-save.

Currently, the messages list is a single Python local variable, but nanobot's sessions are looked up by channel:chat_id. Without keying, a Telegram message and a cron-triggered turn would land in the same list, and the model would see one as a continuation of the other.

Also, JSONL is used for persistence compared to JSON. JSONL (JSON Lines) is a small convention: each line of the file is its own self-contained JSON object, separated by newlines, instead of the whole file being one big JSON array [7]. In nanobot's case, each message is one such line, with a single metadata line at the top of the file. The first benefit is that a crashed or killed write loses at most the last line, not the whole file. The second is that tail -f on a live session file in another terminal shows it grow message by message, which is a surprisingly useful debugging tool once an agent is running on its own.

Exercises

Watch the list grow. Add print(f"--- {len(messages)} messages so far ---") at the top of the chat loop. Run a 10-turn conversation. Each turn, the count should grow by exactly two.
Watch the cost grow. Modify llm to also return response.usage and have chat print usage.input_tokens after every reply. Run a conversation of at least 8 turns and plot the numbers. They should grow roughly linearly with the turn number, and the slope is the average tokens added per turn (the user message plus the model's reply). Multiplying that by the conversation length gives the quadratic total.
Wrap the messages list in a Session. Replace the bare messages: list[dict] = [] in chat with a small dataclass that records key: str, messages: list[dict], created_at: datetime, and updated_at: datetime, and exposes an add_message(role, content) method that appends and bumps updated_at. Use key="cli:default" for now. Then open Session in nanobot/nanobot/session/manager.py and identify two fields and one method the dataclass is missing. Chapters 7 and 15 elaborate on this.
Stretch: slash commands. Extend chat to recognize input that starts with /. Implement /reset (clear messages and start over), /undo (pop the last user/assistant pair), and /print (display the current messages list). Slash commands are a small but real piece of agent UX, and most production assistants have them.
Stretch: durable sessions. Add /save <filename> and /load <filename> slash commands that persist the messages list to disk and read it back. Then, for the production version of the same idea, write to a .tmp file and os.replace() it onto the final path so that a crash mid-save can never leave a half-written file. Compare the result to SessionManager.save in nanobot/nanobot/session/manager.py. While reading, notice that nanobot writes one JSON object per line (JSONL) rather than one big JSON array, and consider why that matters when sessions get long.

References

[1] Using the Messages API — Multiple conversational turns. Claude API documentation. https://platform.claude.com/docs/en/build-with-claude/working-with-messages

[2] Messages — API reference. Claude API documentation. https://platform.claude.com/docs/en/api/messages

[3] PEP 257 — Docstring Conventions. https://peps.python.org/pep-0257/

[4] Sphinx — Python Documentation Generator. https://www.sphinx-doc.org/

[5] Google Python Style Guide — Type Annotations. https://google.github.io/styleguide/pyguide.html#319-type-annotations

[6] __main__ — Top-level code environment. Python documentation. https://docs.python.org/3/library/__main__.html

[7] JSON Lines. https://jsonlines.org/