Chapter 2
Conversation
The code we wrote in Chapter 1 has no memory and every call to LLM is a fresh start. If you ask the model your name, then in a second call ask "what did I just tell you," it has no idea.
Some products go further and remember things about you across entirely separate sessions: your name, your job, the project you mentioned last week. That continuity is what makes them feel like a real assistant rather than a fancy autocomplete.
In this chapter we will fold our current code into a small CLI tool that holds a conversation across turns by accumulating messages in a list.
The model is stateless
In Chapter 1 we already briefly mentioned that the model has no memory. Each call to messages.create is a function from a list of messages to a reply, and forgets everything the moment it returns [1]. Re-sending the whole history every turn is, on its face, wasteful, and we will pay for it in tokens. But it is also what makes the API trivial to reason about because the model has no hidden state.
The accumulation pattern
The pattern under creation of the conversation looks like this:
1. Start with an empty list.
2. User says something — append a {"role": "user", "content": ...} entry.
3. Send the whole list to the model, get a reply back.
4. Append a {"role": "assistant", "content": reply} entry.
5. Go to step 2.
Some important concepts to notice here:
-
The model never sees its own previous responses as
assistantevents generated by some external system. It sees them as part of the input in the same way it sees user messages. -
There is no concept of a "session ID." The conversation is the list.
-
Roles strictly alternate. The Anthropic API combines consecutive same-role messages for you [2]. Other providers vary, with some rejecting that shape outright. Our pattern alternates by construction, so we will not have to worry about this until we start receiving tool calls in Chapter 7.
Building the CLI
Let's come back to main.py and wrap our existing call to the model into an llm function like this:
def llm(messages):
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=messages,
)
return response.content[0].textTwo small improvements before we move on. A docstring — a one-line summary at the top of the function, following PEP 257 [3] — explains what the function does to anyone reading it, and tools like Sphinx [4] can turn those into browsable HTML docs. And type annotations, while not enforced at runtime, document expectations at the call site and let static checkers like mypy catch a class of bugs before you run the code [5]. Both together yield the following result:
def llm(messages: list[dict]) -> str:
"""Send a list of messages to the model and return the reply text."""
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=messages,
)
return response.content[0].textFrom here on we will assume both practices throughout the book without calling them out each time.
Now let's add a chat() function that handles the conversation loop, implementing the algorithm described above:
def chat() -> None:
"""Run an interactive chat loop, accumulating turns in a single messages list."""
messages: list[dict] = []
while True:
user_input = input()
messages.append({"role": "user", "content": user_input})
reply = llm(messages)
messages.append({"role": "assistant", "content": reply})
print(f"assistant: {reply}\n")
Then call chat() from the entry point:
if __name__ == "__main__":
chat()The if __name__ == "__main__": guard is a standard Python idiom: the body runs only when the file is executed directly (e.g., uv run main.py), not when it is imported by another module. This becomes important once your project grows beyond a single file. The Python docs page on __main__ is a short read [6].
Let's run our program and see how it works:
$ uv run main.py
Hi there!
assistant: Hi there! How are you doing today? 😊 Is there something I can help you with?
My name is Olga, what is your name?
assistant: Nice to meet you, Olga! My name is Claude. I'm an AI assistant made by Anthropic. How can I help you today? 😊
What is my name?
assistant: Your name is Olga! You just told me a moment ago. 😊 Is there anything else you'd like to chat about or anything I can help you with?
^D
$Our chat now remembers the conversation, but two edge cases will crash or waste tokens. The user might press Ctrl-D (or Ctrl-C) to exit, which currently raises an exception. The user might also submit an empty line by mistake, which sends a useless message to the model.
Let's also a one-line greeting at startup so a fresh reader knows what they are looking at, and a you: prefix on the input prompt so the transcript is easy to follow.
def chat() -> None:
"""Run an interactive chat loop, accumulating turns in a single messages list."""
messages: list[dict] = []
print("chat — Ctrl-D or empty line to exit\n") # <-- new: greeting
while True:
try: # <-- new: clean exit on Ctrl-D / Ctrl-C
user_input = input("you: ").strip()
except (EOFError, KeyboardInterrupt):
print()
break
if not user_input: # <-- new: skip empty submissions
break
messages.append({"role": "user", "content": user_input})
reply = llm(messages)
messages.append({"role": "assistant", "content": reply})
print(f"assistant: {reply}\n")Watching the list grow
Every turn appends two entries. After the three exchanges from the run above, our messages list looks like this:
[
{"role": "user", "content": "Hi there!"},
{"role": "assistant", "content": "Hi there! How are you doing today? ..."},
{"role": "user", "content": "My name is Olga, what is your name?"},
{"role": "assistant", "content": "Nice to meet you, Olga! My name is Claude. ..."},
{"role": "user", "content": "What is my name?"},
{"role": "assistant", "content": "Your name is Olga! You just told me a moment ago. ..."},
]If you add print(messages) at the top of the loop, you can watch this list grow as you talk. Notice that each assistant reply is appended as plain text, indistinguishable in shape from the user messages above it.
The context window is where memory ends
The model can read every message in the list, but only up to the moment it reaches the context window. Context windows vary from 100k to 1M tokens. A million tokens is roughly seven hundred and fifty thousand English words, which is about the length of a long novel, every single time you make a request.
You will hit this limit if you:
- Paste large files into the conversation.
- Run an agent that calls many tools and accumulates large tool results in the history.
- Let a long-running agent talk to itself for hours or days.
- Or do all three at once, which is what real agents look like in production.
When you exceed the context window, the API does not silently truncate older messages for you. It returns a 400 error: prompt is too long. This is a good default because silent truncation would corrupt conversations in subtle, hard-to-debug ways. But it is also something we will eventually have to handle.
Chapter 15 introduces session memory: the layer that watches the conversation grow and summarizes old turns into a shorter form before they overflow the window. Chapters 16 and 17 then add long-term memory, allowing the agent to remember things between separate conversations, after the in-conversation list has been thrown away.
The cost shape of a conversation
Beyond the hard limit of the context window, there is a soft cost: every turn re-sends the entire history. On turn N of a conversation, you pay to process all the tokens from turns 1 through N−1, plus the new user message, plus generating the new reply.
Plotted out, the input-token cost of an N-turn conversation goes like:
turn 1: 1 message
turn 2: 3 messages (turn 1's user, turn 1's assistant, turn 2's user)
turn 3: 5 messages
...
turn N: 2N - 1 messages
The total input-token cost across N turns is therefore proportional to N², not to N. A conversation that is twice as long costs roughly four times as much. This is fine for short interactions but becomes a real problem when an agent is doing something multi-step over hundreds of turns.
Two things eventually save us from this:
-
Prompt caching. Most providers offer a cache: if you re-send a prompt prefix that was sent recently, you pay roughly 10% of the normal input-token rate for the cached portion. We will turn this on in Chapter 4 once we have streaming in place.
-
Memory layers that compress old turns instead of carrying every word forward. This will be covered in Chapters 15 through 17.
For Chapter 2 the takeaway is just: a long conversation is expensive and slow, and the cost grows quadratically. Notice it in the tokens-per-turn report from Exercise 2.
Production reference
In nanobot, the equivalent of our messages list lives inside nanobot/nanobot/session/manager.py. The file defines two classes that, together, do the work our chat() function does inline today:
Sessionwraps the messages list with a few extra fields a real agent needs: akey("telegram:12345"or"cli:default"— which channel and which user),created_atandupdated_attimestamps, ametadatadict for per-session state, andlast_consolidated, an index used by the memory layers in later chapters to remember how much of the history has already been summarized.SessionManageris the layer above. It owns a_cachedict mapping keys toSessionobjects, so multiple turns from the same user hit the same session in memory. It knows how to load a session from disk (_load) and save one back (save) andget_or_create(key)is the single entry point an incoming message goes through.
Three methods are worth tracing once you have written your own version. Each maps directly to something we wrote or discussed in this chapter:
Session.add_message(role, content, **kwargs)is the production version of ourmessages.append({"role": ..., "content": ...}). It additionally stamps each message with atimestampand accepts arbitrary**kwargs(for image attachments, tool-call IDs, channel-delivery flags) without changing the call site. Same shape, more room.Session.get_history(max_messages=120, max_tokens=0, ...)is what we will eventually need instead of "send the entire list every turn." It slices the tail of the conversation to a message count, then optionally to a token budget, and is careful never to start the slice on anassistantturn or on an orphan tool result — both of which would confuse the model. Chapter 15 builds the version that decides what to send.SessionManager.save(session, fsync=False)persists a session to a JSONL file. The pattern worth noticing is the atomic write: it writes to a.jsonl.tmpfile first, thenos.replace()s it onto the real path. That keeps the on-disk file from ever being half-written if the process is killed mid-save.
Two production nuances worth flagging now, because they are invisible from a single-user CLI but inevitable the moment a real agent serves more than one conversation:
- Sessions are keyed, not global. Our
messageslist is a single Python local variable but nanobot's sessions are looked up bychannel:chat_id. Even a personal agent ends up with several conversation threads going at once: the CLI you are using to test it, your Telegram chat with it, a scheduled cron job that wakes the agent up to summarize your morning. Without it, a Telegram message and a cron-triggered turn would land in the same list, and the model would see one as a continuation of the other. - Persistence is JSONL, not JSON. JSONL (JSON Lines) is a small convention: each line of the file is its own self-contained JSON object, separated by newlines, instead of the whole file being one big JSON array [7]. In nanobot's case, each message is one such line, with a single metadata line at the top of the file. The first benefit is that a crashed or killed write loses at most the last line, not the whole file. Also, you can
tail -fa live session file in another terminal and watch it grow message by message, which is a surprisingly useful debugging tool once an agent is running on its own.
The closest single file to read after this chapter is nanobot/nanobot/session/manager.py. Try to recognize, on first read, the two-line append pattern from our chat() loop hiding inside Session.add_message.
Exercises
-
Watch the list grow. Add
print(f"--- {len(messages)} messages so far ---")at the top of thechatloop. Run a 10-turn conversation. Each turn, the count should grow by exactly two. -
Watch the cost grow. Modify
llmto also returnresponse.usageand havechatprintusage.input_tokensafter every reply. Run a conversation of at least 8 turns. Plot the numbers. They should grow roughly linearly with the turn number and the slope is the average tokens added per turn (your message + the model's reply). Then you can multiply it by the conversation length and you get the quadratic total. -
Wrap your messages list in a Session. Replace the bare
messages: list[dict] = []inchatwith a small dataclass that recordskey: str,messages: list[dict],created_at: datetime, andupdated_at: datetime, and exposes anadd_message(role, content)method that appends and bumpsupdated_at. Usekey="cli:default"for now. Then openSessioninnanobot/nanobot/session/manager.pyand identify two fields and one method your version is missing. We will elaborate on this in chapters 7 and 15. -
Stretch: slash commands. Extend
chatto recognize input that starts with/. Implement/reset(clearmessagesand start over),/undo(pop the last user/assistant pair), and/print(display the currentmessageslist). Slash commands are a small but real piece of agent UX and most production assistants have them. -
Stretch: durable sessions. Add
/save <filename>and/load <filename>slash commands that persist the messages list to disk and read it back. Then, for the production version of the same idea, write to a.tmpfile andos.replace()it onto the final path so that a crash mid-save can never leave a half-written file. Compare your code toSessionManager.saveinnanobot/nanobot/session/manager.py. While you are there, notice that nanobot writes one JSON object per line (JSONL) rather than one big JSON array and think about why that matters when sessions get long.
References
[1] Using the Messages API — Multiple conversational turns. Claude API documentation. https://platform.claude.com/docs/en/build-with-claude/working-with-messages
[2] Messages — API reference. Claude API documentation. https://platform.claude.com/docs/en/api/messages
[3] PEP 257 — Docstring Conventions. https://peps.python.org/pep-0257/
[4] Sphinx — Python Documentation Generator. https://www.sphinx-doc.org/
[5] Google Python Style Guide — Type Annotations. https://google.github.io/styleguide/pyguide.html#319-type-annotations
[6] __main__ — Top-level code environment. Python documentation. https://docs.python.org/3/library/__main__.html
[7] JSON Lines. https://jsonlines.org/