Skip to main content

Quality and security of AI-generated code: Thoughts on a process

I've been experimenting with AI code generation for a side project written in Golang. The project has been implemented by Opus 4.6 (Claude Code) under my direction. This is the first time I've used Golang so I'm pretty slow and can't scrutinise the output as thoroughly as I could PHP.

Are there processes we can follow to reduce security risks, when working with machine-generated code? I think so. My high-level process has been to:

  • Have a discussion with the model about a feature or a change, to identify a good approach. It often comes up with better ideas or refinements.
  • Explicitly ask for an implementation plan, causing the model to break up the problem into a structured series of small steps, which I sanity check (read) and adjust.
  • Ask the model to implement the plan (if it is complex, perhaps one phase at a time).
  • Manually review that the change functions as expected.
  • Critical: Explicitly ask the model to review the changes and evaluate if it is a robust solution (repeat if necessary).

This process works well for two reasons. Firstly, it breaks up the work into small, carefully scoped chunks that fit within the model's context window, keeping it focussed. Secondly, the review aspects (the manual check, and instruction to critically review the work) removes a lot of bugs, so you maintain a solid foundation to work from. Most of the time Opus will find a few bugs in its implementation, if you ask it to check, and it may take two or three rounds before it stops finding problems.

Before you start: Set guards

You can completely eliminate certain classes of security issue by setting explicit guards in your project CLAUDE.md or AGENTS.md instruction file. Want to eradicate SQL injection? Set an explicit requirement to use prepared statements with bound parameters for all queries. There are many others you can set requirements for: Protocols for safe handling of file uploads, XSS escaping, and so on. The model may have occasional lapses with issues that have hundreds of instances such as XSS, but enforcing prepared statements seems to be something it is particularly good at.

Small iterative code reviews

Conducting code reviews on each chunk of work is essential. Opus makes a lot of mistakes, and it's a lot better at finding problems when working within the first 50% of its context window. Given a few files to review there's a good chance it will find most of the bugs. But if you wait until the end of the project to clean up the whole code base, it's going to miss a huge amount of stuff, and you'll have a huge mess on your hands.

I did find quite a few situations where experience lead me to reject a design that was technically correct but would have unintended consequences. For example, Opus built a blazingly fast thumbnail generation and caching system. Worked great! But instead of deriving image width from values hard coded into the template (my specification), it deviated from the plan and made a public endpoint where you could pass in the desired thumbnail size as as parameter. This meant that an attacker could DOS the server by asking for a few thousand different size variants of each image, each of which would trivially crush the CPU.

Read the output!

You must read the output as the model works on the task. As the model chats away to itself you will find a lot of critical information:

  • Bugs it noticed, but decided not to address because they are "out of scope" for the current task, "pre-existing" or "too minor" to warrant fixing.
  • Inefficient code that it noticed, but decided not to optimise because it is "acceptable at the present scale".
  • Bad design it noticed but chose not to address because it represents a "design decision", "trade off" [insert excuse here] or was "deliberate".
  • Taking shortcuts and circumventing assigned work justified as quickly reviewing the "key files" (when you told it to review all files).

The bold terms are some of the keywords you will find in Claude Code output when it has done something lazy, unhelpful or shoddy. If you challenge these cases it will almost always acknowledge that it had been shirking and will go back and complete the task.

Cross-model reviews

I only have one paid subscription, but I've found some value in asking other models (eg. Gemini) to look at particular issues or files. This provides a bit of additional coverage, but there seems to be diminishing returns where a file has already been reviewed by Opus, as it seems a lot stronger than the competition.

Local LLM reviews: Good for individual files, useless for projects

I hooked up a couple of local LLMs (Nemotron, Gemma) using LM Studio's API server and used these to run additional reviews of my code base. The models are good, but their usefulness is limited by the heavy hardware requirements, slow processing speed and their limited context window.

My PC is an AMD 5950X with 32 GB of RAM and a 9070XT GPU, and it's barely enough to run a mid-tier local model. Processing a single file can take quite a few minutes and a code base can take all day. But the main issue is the local models can only really analyse one or two files at a time.

Operations that require work across the code base will generally fail due to the limited context window. So I wrote a python script that will walk a project directory and pass files individually to the API server for review. Each file is passed in a new context window and the model's response was appended to a report file.

The report was certainly thorough in terms of flagging potential issues, but they were entirely false positives. When the model can only review one file at a time, it doesn't have any context about its interaction with other parts of the system. For example, it will flag input as 'unvalidated' all over the place because it is not aware of the upstream validation that lives in another file. Out of curiosity, I fed the local LLM report back into Opus file by file and asked it to review the findings, and it also rejected them.

The local LLMs didn't catch any problems that Opus had missed. So: Useful for working with small projects and scripts, but not practical for anything larger, and not a substitute for a strong coding model like Opus.

Test driven development

Test driven development may also be useful. To be honest I never bother with this in my own (manual) projects because it's way to painful. But the tests form a specification, and for the AI writing the tests is also an extension of the plan. Tests provide the model with context in small structured chunks, which is great for keeping it on track, and provides a benchmark for success.

Final security sweep

Lastly, despite my previous comments, I have experimented with using Claude Code to sweep an entire hand-crafted code base: Tuskfish. I have invested a lot of time and effort over the years in making Tuskfish as secure as I can (acknowledging the limits of my ability). So I was curious to see if the AI would find anything that I missed.

I ran the sweep twice, once with Sonnet 4.5 and a few months later with Opus 4.6. The good news is nothing serious was found, but the results were quite different. After explaining the false positives and a few design tradeoffs, Sonnet basically found nothing and wrote a glowing, somewhat sycophantic report. Opus provided more sober feedback, with a couple of issues it classed as 'medium' (nothing exploitable, more like "things you could tighten") and a bunch of issues that were "low" which are trivialities. But most of these were things I would never have spotted myself.

As security sweeps must address different issues occurring across a range of subsystems, it's a good idea to get your AI to make a plan for the review - you can explicitly add issues and checks you want it to make - then execute it one subsystem or major issue at a time. This keeps the AI focussed, you will get better results.

Attacker vs defender: Managing the context window to maintain advantage

We now live in a world where both attackers and defenders can sweep an entire code base in minutes to hours. But AI review is not perfect, it misses things, especially when its context window starts to get full, which is inevitable with any signficant project.

Both attackers and defenders can use this to their advantage: Whoever does a better job at constraining the context window will find more bugs. So: both attackers and defenders should review code in small chunks, or narrow their focus to individual issues, iterating over multiple passes instead of dumping the whole code base on the model.

Conclusion

I've thrown everything I have at my side project's AI-generated code. As far as I can tell, if you work in small scoped chunks and test/review them incrementally, you can get a solid result. However, if you go the full YOLO and blast out code without rigorous incremental planning and review I expect your project will be riddled with bugs, and very difficult to clean up.

Don't buy the bullshit of the "AI bros": You can't fully automate

All of this makes me wonder about the people that claim to be running "dark factories" with fleets of agents building software in parallel 24/7 and producing tens of thousands of lines of code per day. Leaving aside the question of how you can queue up work faster than the AIs can do it, if nobody is looking at the code or keeping an eye on what the models are actually doing, I don't see how that ends well. The Dark Factory is a beautiful dream, but it doesn't match the observed reality. It smells like hype and bullshit to lure investors.

My own experience is that code generated by strong models is full of bugs. You never hear people talk about that on YouTube! Even with a small chunk of work, Opus will typically find two or three bugs, shortcomings or unaddressed edge cases if you ask it to review the output. It does seem to be able to clean most of them up with a bit of pushing, but not all. Sometimes it just makes a poor decision, or makes something that works great but is a DOS vector, or uses a pattern it found elsewhere in the code base which is not appropriate for the present task.

I suppose with really good job specifications including TDD and AI review processes that are refined over time you could get reasonable quality output that mostly works and meets your objectives, in a partly optimised way. But that absolutely does not scale because the cleanup will require manual review and assessment by the end user. Without it, that code is going to be as fragile as hell. There isĀ no way it will survive in a hostile environment.

The 80/20 rule applies, and the devil really is in the detail. What is the point in automating code generation at scale (20% of your time) if you don't have the resources to bring it up to production grade (80% of your time)? I thought I had "finished" my side project within about three weeks...then spent another six weeks polishing and testing it! That was for a project with about 20K lines of original code.

The AI is quite good at filling in little details you didn't ask for that add value, but there's so many things that need to be adjusted. There's no way to automate that, unless your projects are all the same. Or unless you're willing to lower your quality bar to 'slop'.

Seems like there has been a lot of magical thinking and people wanted to default to slop, but now that the AI Apocalypse has arrived, that is probably going to get fixed real quick. I am curious to see how the Dark Factory bros fare. Not well, I think. But hey, they'll snag plenty of bucks before moving onto the next bullshit story.

Copyright, all rights reserved.