Skip to main content

Security of AI-generated code: Thoughts on a process

I've been experimenting with AI code generation for a side project written in Golang. The project has been implemented by Opus 4.6 (Claude Code) under my direction. This is the first time I've used Golang so I'm pretty slow and can't scrutinise the output as thoroughly as I could PHP.

I've been thinking a lot about security. Are there processes we can follow to reduce risk, when working with machine-generated code? I think so. My high-level process has been to:

  • Have a discussion with the model about a feature or a change, to identify a good approach. It often comes up with better ideas or refinements.
  • Explicitly ask for an implementation plan, causing the model to break up the problem into a structured series of small steps, which I sanity check (read) and adjust.
  • Ask the model to implement the plan (if it is complex, perhaps one phase at a time). You need to read the output as sometimes the model will get a bit lazy and take shortcuts, or notice an issue but declare it "too minor" to warrant fixing.
  • Manually check that the change functions as expected.
  • Explicitly ask the model to review the changes and evaluate if it is a robust solution (repeat if necessary).

This process works well for two reasons. Firstly, it breaks up the work into small, carefully scoped chunks that fit within the model's context window, keeping it focussed. Secondly, the review aspects (the manual check, and instruction to critically review the work) removes a lot of bugs, so you maintain a solid foundation to work from. Most of the time Opus will find a few bugs in its implementation, if you ask it to check, and it may take two or three rounds before it stops finding problems.

Small iterative code reviews

Conducting code reviews on each chunk of work is essential. Opus makes a lot of mistakes, and it's a lot better at finding problems when working within its context window. Given a few files to review there's a good chance it will find most of the bugs. But if you wait until the end of the project to clean up the whole code base, it's going to miss a huge amount of stuff, and you'll have a huge mess on your hands.

I did find quite a few situations where experience lead me to reject a design that was technically correct but would have unintended consequences. For example, Opus built a blazingly fast thumbnail generation and caching system. Worked great! But instead of deriving image width from values hard coded into the template (my specification), it deviated from the plan and made a public endpoint where you could pass in the desired thumbnail size as as parameter. This meant that an attacker could DDOS the server by asking for a few thousand different size variants of each image.

Cross-model reviews

I only have one paid subscription, but I've found some value in asking other models (eg. Gemini) to look at particular issues or files. This provides a bit of additional coverage, but there seems to be diminishing returns where a file has already been reviewed by Opus, as it seems a lot stronger than the competition.

Local LLM reviews

I hooked up a couple of local LLMs (Nemotron, Gemma) using LM Studio's API server and used these to run additional reviews of my code base. The models are good, but their usefulness is limited by the heavy hardware requirements, slow processing speed and their limited context window.

My AMD 5950X has 32 GB of RAM and a 9070XT GPU, and it's barely enough to run a mid-tier local model. Processing a single file can take quite a few minutes and a code base can take all day. But the main issue is the local models can only really analyse one or two files at a time.

Operations that require work across the code base will generally fail due to the limited context window. So I wrote a python script that will walk a project directory and pass files individually to the API server for review. Each file is passed in a new context window and the model's response was appended to a report file.

The report was certainly thorough in terms of flagging potential issues, but they were entirely false positives. When the model can only review one file at a time, it doesn't have any context about its interaction with other parts of the system. For example, it will flag input as 'unvalidated' all over the place because it is not aware of the upstream validation that lives in another file. Out of curiosity, I fed the local LLM report back into Opus file by file and asked it to review the findings, and it also rejected them.

The local LLMs didn't catch any problems that Opus had missed. So: Useful for working with small projects and scripts, but not practical for anything larger, and not a substitute for a strong coding model like Opus.

Test driven development

Test driven development may also be useful. To be honest I never bother with this in my own projects because it's way to painful. But the tests form a specification, and for the AI writing the tests is also an extension of the plan. Tests provide the model with context in small structured chunks, which is great for keeping it on track, and provides a benchmark for success.

Final security sweep

Lastly, despite my previous comments, I have experimented with using Claude Code to sweep an entire hand-crafted code base: Tuskfish. I have invested a lot of time and effort over the years in making Tuskfish as secure as I can (acknowledging the limits of my ability). So I was curious to see if the AI would find anything that I missed.

I ran the sweep twice, once with Sonnet 4.5 and a few months later with Opus 4.6. The good news is nothing serious was found, but the results were quite different. After explaining the false positives and a few design tradeoffs, Sonnet basically found nothing and wrote a glowing, somewhat sycophantic report. Opus provided more sober feedback, with a couple of issues it classed as 'medium' (nothing exploitable, more like "things you could tighten") and a bunch of issues that were 'low' which are trivialities. But most of these were things I would never have spotted myself.

Conclusion

I've thrown everything I have at my side project's AI-generated code. As far as I can tell, if you work in small scoped chunks and test/review them incrementally, you can get a solid result. However, if you go the full YOLO and blast out code without rigorous incremental planning and review I expect your project will be riddled with bugs, and very difficult to clean up.

Which makes me wonder about the people that are running fleets of agents in parallel 24/7. Leaving aside the question of how you can queue up work faster than the AIs can do it, if nobody is looking at the code or keeping an eye on what the models are actually doing, I don't see how that ends well. I suppose with really good job specifications and dedicated AI review processes refined over time you could get reasonable quality output that mostly works. But I wouldn't want to be running that code in a hostile environment. Or using it on anything important, like flying a plane.

Copyright, all rights reserved.