Rob Napier - TIL:AI. Thoughts on AI

I use AI a lot for work, pretty much all day every day. I use coding assistants and custom agents I’ve built. I use AI to help code review changes, dig into bugs, and keep track of my projects. I’ve found lots of things it’s very helpful with, and lots of things it’s terrible at. If there’s one thing I have definitely learned: it does not work the way I imagined. And the more folks I talk with about it, the more I find it doesn’t work like they imagine, either.

This is a collection of various things I’ve learned about AI in the time I’ve spent working with it. It’s not exhaustive, and I expect to keep updating it from time to time as I learn more things and as things change.

An AI is not a computer. But an AI can use a computer. That is probably the most important lesson I’ve learned about these systems. A huge number of misconceptions about what AI is good for come from the assumption that it is a computer with a natural language interface. It absolutely is not that. It is a terrible computer in very much the same way that you are a terrible computer. It is pretty good at math, but it is not perfect at math in the way that a computer is. It is pretty good at remembering things, but it does not have perfect memory like a computer does.

AIs have a limited block of “context” that they can operate on. These range from a few tens of thousands of tokens up to around a million tokens. A token is a bit less than an English word worth of information, and a typical novel of a couple of hundred pages is on the order of 100,000 tokens. Even a moderately-sized project can be in the millions of tokens.And not all of the context is available for your task. Substantial parts of the context window may be devoted to one or more system prompts that instruct the model how to behave before your prompt even is looked at.

If you tell an AI “rename PersonRecord to Person everywhere in my codebase,” it sounds really straightforward. But an obvious way for the AI to do this includes reading all the files. That can overflow its context, and it may forget what it was working on. Even if it’s successful, it will be very slow. It’s very similar to asking an intern to print out all the files in the repository and go through each of them looking for PersonRecord and then retyping any file that needs changes. AI reads and writes very quickly, but not that quickly. They are not computers.

The better approach is to tell the AI “write a script to rename PersonRecord to Person, and then run that script.” This they can do very well, just like the intern. (I’m going to talk about interns a bit here.) Now it only needs to keep track of a small script, not every word of the entire repository. Scripts are fast and consistent. AIs are not. If you want an AI to use a computer, you often need to tell it to do so.

If you use a coding assistant, it may have a system prompt that tells the model about tools so you don’t have to. In Cline the majority of the 50KB prompt is devoted to explaining the available tools and when and how each should be used. Even though all the major models include extensive information about tools that exist, it is not obvious to AIs that they can or should use them. They have to be told. And in my experience, they often forget in the middle of a job. “Remember to use your tools” is a pretty common prompt.

Context windows are not like RAM. A common question is “can’t you just make the context window larger?” Basically, no. The size of the context window is set when the model is created. Essentially the model has a certain total size, and its size has a lot of impacts, both in cost to train and run, and even whether it works well or not. Making models bigger doesn’t always make them work better.

A part of that total size is the context window, the space where all the input text lives while the AI is working. This window isn’t some untrained chunk that can be expanded on the fly; it’s fully integrated into the model architecture, baked in during training. You can’t just bolt on more capacity like adding RAM or a bigger hard drive. And it’s a bit like human memory. Sometimes the AI forgets things or gets distracted, especially when you pack in too much unrelated stuff. Ideally, the AI should have just what it needs for the task, no more.

Context windows also include everything in kind of a big pile. When you upload a document to ChatGPT and type “proofread this,” there’s no deep separation between the document and the instruction. Even keeping things in order doesn’t come for free. It can be difficult for an AI to distinguish between which parts of the context it’s supposed to follow and which parts it’s supposed to just read. This allows prompt injection attacks, but even in more controlled settings, it can lead to unexpected behaviors.

Unlike SQL Injection, there’s no clear solution to this problem. You can add more structure to your prompts to make things clearer, but it’s a deep problem of how LLMs are designed today. Today the answer is mostly “guardrails,” which is basically “its secure, there’s a firewall” for AI. As a former telecom engineer and security assessor, this is the thing that makes me most ask “have we learned nothing?”

AIs do not learn. It is easy to imagine that a model like Claude is constantly adapting and learning from its many daily interactions, but this isn’t how AIs generally work. The model was frozen at the point it was created. That, plus its context, is all it has to work with. New information does not change the model day to day. And every time you start a new task, all of the previous context is generally lost. When you say “don’t do X anymore” and the AI responds “I’ll remember not to X in the future,” it’s critical to understand it has no built-in way to remember that.

In most systems, the only way for the AI to remember something is for it be written down somewhere and then read back into its context later. Think Leonard from Memento, and you’ll have the right idea. This leads to a bunch of memory tools that work in a variety of ways. It might be a human, writing new things into the system prompt. It might be a more persistent “session,” but most often it’s some external data store that the model can read, write, and search. It might be an advanced database, or they might just be a bunch of markdown files. But the key point is it’s outside the model. The model generally doesn’t change without a pretty big investment by the whoever maintains the model.

Even with the rise of memory systems, most interactions today have very limited memory. Systems like the Cline Memory Bank can work much better in theory than practice. It’s challenging to get AIs to update their memory without nagging them about it (kind of like getting people to write status reports). More advanced systems that provide backend databases don’t just drop in and work. You need to develop the agent to use them effectively. Even the most basic memory systems (long running sessions) require context management to keep things working smoothly. You should generally assume that tomorrow your AI will not remember today’s conversation well if at all.

AIs are like infinite interns. Rather than thinking of AIs as natural language interfaces to super-intelligent computers, which they are not, it can be helpful to think of them as an infinite pool of amazingly bright interns who all work remotely and you can assign any task for a week.

You can ask them to read things, write things in response to what they’ve read, write tools, run tools, do just about any remote-work task you like. But next week, this batch will be gone, and you’ll get a new batch of interns.

How should you manage them? You can ask them to read all your source code, but next week they’ll need to read it all again. You can ask them to read your source code and write explanations. That’s better. Then the next group can read the explanations rather than starting from scratch. But what if they misunderstand and write the wrong explanation? Then you’ve poisoned all your interns. They’ll all be confused. You need to read what they write and correct it. Better, maybe you should write the explanations in the first place. If you don’t know the system well enough to explain it to the interns, then you’re going to be in trouble. You’d better learn more. Maybe a bunch of interns could help you research?

You can assign them tasks, but remember, they’re interns. They’re really smart, but they’ve never really worked before. They know stuff, but not the stuff you need them to know. And they do not learn very well. How do you make them useful? You need to be pretty precise about what you want. They’re distractible. They don’t know how to coordinate their efforts, so even though you have infinite interns, there are only so many you can use together. It’s up to you to help them structure their work. Maybe you can train one of them to be in charge and organize (“orchestrate”) the others. Maybe you need someone to orchestrate the orchestrators. It’s starting to feel like system design.

Wait a minute. Wasn’t AI supposed to do all of this for me? Oh, sweet summer child. You thought that AI would mean less work? No. AI means leverage. You can get more out of your work, but you’ll work harder for it. You can get them to write you code, but you’ll spend that time writing more precise design specs. You can get them to write design specs, but you had better have your requirements nailed down perfectly. Leverage means you have to keep control. AI will take you very far, very fast. Make sure you’re pointed in exactly the right direction.

Reviewing AI code requires special care. When reviewing human-written code, we often look for certain markers that raise our trust in it. Is it thoroughly documented? Does it seem to be aware of common corner cases. Are there ample tests?

But AI is really good at writing extensive docs and tests. It takes a bit of study to realize that the docs are just restating the method signatures in more words, and the tests are so over-mocked that they don’t really test anything. And everything is so professional and exhaustive, it puts you in a mind that “whoever wrote this must have known what they’re doing.” And that has definitely bitten me. You can say “always be careful,” but when reviewing thousands of lines of code, you have to make choices about what you focus on.

It’s especially important because AI makes very different kinds of mistakes than humans make, and makes radically different kinds of mistakes than what you would expect given how meticulous the code looks. So knowing you’re looking at AI-generated code is an important part of reviewing it properly. AI is much more likely to do outrageous “return 4 here even though it’s wrong to make the tests pass” (and comment that they’re doing it!) than any human.

Conversely, AI is pretty good at reviewing code. I actually like it better as a code reviewer than a code writer, and I currently have it code review in parallel everything I review. It’s completely wrong about 30% of the time. And 50% of the time it doesn’t find anything I didn’t find. But about 20% of the time it finds things I missed, and that’s valuable. Just be careful about making it part of your automated process. That 30% “totally wrong and sometimes completely backwards” will lead junior developers astray. You need to be capable of judging its output.

I do find that AI writes small functions very well, and I use it for that a lot. I often build up algorithms piecemeal and by the time it’s done, it’s kind of a mess and I want to refactor it down into something simpler. One of my most common prompts is “simplify this code: …pasted function…” More than once in the process, it’s found corner cases I missed. And when it turns my 30-line function into 5 lines, it’s generally very easy to review.

AI is emergent behavior. Almost everything interesting about AI is due to behaviors no one programmed, and (at least today) no one understands. LLMs can do arithmetic, but they don’t do it perfectly, which surprises people. But what’s surprising is that they can do it at all. We didn’t “program” the models to do arithmetic. They just started doing it when they got to a certain size. When they “hallucinate,” it’s not because there’s a subroutine called make_stuff_up() that we could disable. All of these things are emergent behaviors that we don’t directly control, and neither does the AI.

We try, through prompting, to adjust the behaviors to align with what we want, but it’s not like programming a computer. “Prompt engineering” is mostly hunches today. Giving more precise prompts seems to help, but even the most detailed and exacting prompt may not ensure an AI does what you expect. See “infinite interns.” Or as Douglas Adams says, “a common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools.”

AI does not understand itself. LLMs have no particular mechanism to self-inspect. It mostly knows how it works through its training set and sometimes through prompting. Humans also do not innately know much about the brain or how it works, but may have learned about it in school. Humans have no particular tool for inspecting what their brains are doing. An AI is in the same boat. When you ask an AI “why” it did something, it’s similar to asking a human. You may get a plausible story, but it may or may not be accurate. Sometimes there’s a clear line of reasoning, but sometimes there isn’t.

Similarly, asking an AI to improve its own prompts is a mixed bag. The only thing it really has to work with is suggestions that were part of its training or prompt. So at best they know what people told them would help, which mostly boils down to be “be structured and be precise,” which we hope will help, but doesn’t always.

AI is nondeterministic. Just because a prompt worked once does not mean it will work the same way a second time. Even with the same prompt, context and model, an LLM will generally produce different results. This is intentional. How random the results will be is a tunable property called temperature, so for activities that should have consistent behaviors, it can be helpful to reduce the temperature. Reducing the temperature prevents the model from straying as far from its training data, so setting it too low can make it unable to adapt to novel inputs.

But ultimately, if you need reliable, reproducible, testable behavior, AI alone is the wrong tool. You may be better off having an AI help you write a script, or to create a hybrid system where deterministic parts of the solution are handled by traditional automation, with an AI interpreting the results.

AI is changing quickly. I’ve tried to keep this series focused on things I think will be true for the foreseeable future. I’ve avoided talking about specific issues with specific tools, because the tools are changing at an incredible pace. That doesn’t mean they’re always getting better. Sometimes, they’re just different, and the trade-offs aren’t always clear.

But overall, things are getting better, and things that weren’t possible just a few months ago are now common practice. Agentic systems have been revolutionary, and I fully expect multi-agent systems to radically expand what’s possible. One-bit models may finally make it practical to run large models locally, which would also completely change the use-cases. I expect the landscape to be very different a year from now, and if AI does not solve a problem today, you should re-evaluate in six months. But I also expect a lot of things that I’ve said here to stay more or less the same.

AI is not a silver-bullet. It does not, and I expect will not, be an effective drop-in replacement for people. It’s leverage. Today, in my experience, it’s challenging to get easy productivity gains from it because it’s hard to harness all that leverage. I expect that to eventually change as the tools improve and we learn to use them better. But anyone expecting to just “add AI” and make a problem go away today will quickly find they have two problems.