I feel there’s definitely a blog post in here. I don’t want to get distracted from my testing series, but my head’s just churning and I need to write stuff, so maybe I’ll just sketch a story until I can turn it into something more coherent.

I’ve been working on this small program to convert close-scored choral music into separate rehearsal tracks. I currently do it by hand in MuseScore, but…come on. I’m a programmer. I build stuff.

But really I wouldn’t build this. It’s too much trouble. There’s so many little headaches dealing with music notation formats and seriously, it would save like 10 minutes at best every so often if I could get it working perfectly, and way less if there are any manuals steps. So, you know, it goes on the shelf?

But then I’m thinking that AI can help me do it quick. And then I’ll have a thing that I literally wouldn’t have built otherwise. And that’s a really great story. I’ve built a lot of small AI things at work. They’ve all been net-negative in terms of productivity, but I can see how we’d get to a worthwhile system. Every technology I’ve ever learned has been net-negative for a while. I’m used to that. Learning curves. And the tooling around AI is still in its infancy.

This just feels like a great use case. I don’t really know the format. I don’t know the frameworks. It’s going to be Python, which I kind of know, but it’s not my strength. And Claude 4 just came out. And I have Cline. And I pull out my wallet to buy some credits and let’s do this thing.

And it starts so well. So well. Like magic. It’s amazing. Cline is designing things to solve my problem and all I need to do is look over some code. I’m not quite “vibe coding,” but I’m not writing anything, and often I just let it go and review it when it finishes a phase. I design a whole project plan for it and it’s chewing through it.

Now Cline has some really annoying problems. And one of them is that it is incredibly inefficient cost-wise, because it constantly rewrites files from scratch rather than editing them. I finally get sick of burning cash and switch to Roo. And it’s like magic riding a unicorn. It is so much better. I am blown away. It is designing things and coding things and debugging things, and it’s amazing and glorious.

And I start to wonder…. should this be glorious? It doesn’t feel like that hard a problem. I mean, it has a bunch of little corner cases, but, still. It’s not that hard. But Roo has found a bug and is running it to ground with test after test after test. It’s tweaking things. It’s getting more tests to pass.

But wait a minute. What is this “more tests to pass?” It’s using passing tests as a measure of success. Moving from 2% passing to 10% passing makes it think it’s on the right path. But I realize it’s getting there by building one work-around after another, trying to make progress. Any time a method doesn’t return what it expects, it adds another fallback, trying approach after approach during every run, with a comment indicating “sometimes the framework doesn’t return the expected value.” It’s not experimenting and then using its experiment to drive the software. It’s putting everything it thinks might work in the function bracketed with else and catch and hoping. At one point it literally tries to writes a function fix_measure_29() where it just adjusts everything in one specific measure of one specific song so the test will pass. I rejected that change.

At this point I’ve been working with it for around 2-2.5 days. I’m about $90 deep into tokens. We got to 80%-ish functionality really fast, but then stalled. The architecture is such a mess that every improvement takes forever and breaks other things. Claude constantly tells me it “works perfectly” when it has exactly the same bug as before. It constantly reaches for private APIs and internal properties. And it’s built so many fallbacks that nothing ever fails with a clear error. It’s built confidence values for each computation, with configurable options to let you override its heuristics. I told it the exact layout of the voice parts (soprano is part 1, voice 1; alto is part 1, voice 2; etc.) Even so, Claude is doing analysis on the note ranges to decide what notes go to what part. That’s a problem, because in this song the Soprano part goes well into the Tenor range. I didn’t ask for any of that.

The program has grown first to about 800 lines and now over 2800 lines. There’s a ContextualUnifier, a SimpleSlurFixer, a StaffSimplifier, a VoiceAwareSlurFixer, a DeterministicVoiceIdentifier, a VoiceRemover, the list goes on. At first, I was amazed at all of the deep design documentation it was writing. There’s about 6000 words of design specs, bug analysis, project plans. It’s really quite amazing. But the program doesn’t work and I’m burning a lot of money.

This morning, I thought I’d try something different. I started reading the docs for the library I’m using. Just…reading them. All of them. And then I kind of thought about the problem for a while. I asked Roo to scaffold me out a Python commandline tool, and write some loading logic. I started directing it to build a really simple app based on what I knew about how the framework is supposed to be used. It overcomplicated things, and I had to tell it to stop, and then just delete the code. But I got something working in a really simple way in maybe half an hour for about $1 worth of tokens. I used ChatGPT ($20/month Plus plan) to write a couple of small functions where I couldn’t remember the syntax in Python.

Eventually I just started coding things by hand. I probably spent 2-3 hours on it today. It’s 162 lines of Python so far, which feels like the right kind of size. It has a few things more I need to fix, but it mostly works, and I mostly understand the bits that don’t.

And I have no idea what lesson you’re supposed to take away from this little story. I know many will say “duh, magic autocomplete sux!” And if that’s your opinion, that cool, and I really don’t want to argue with you about it. But I don’t think it’s the right lesson for me. The right lesson for me is that AI is a weird tool and it’s changing incredibly quickly, and right now I don’t think we have any idea when to use it or how to use it. Three months ago, I didn’t see much promise at all. Then I played with my first agentic system and it changed my mind. Even in the total mess Claude made, I saw how it could work. I can see where the improvements can happen.

I’ve seen so many semi-programmers and even non-programmers build automation tools for themselves to make their lives easier. That’s real stuff. I want that for them. I may never write a shell script by hand again. AI is just too good at it and the stakes are so low. AI coding will matter and I think in the end, it’ll be a lot like high-level languages. They made more things possible and so we needed more programmers.

But today, maybe “I don’t need to read the docs; I have an AI” may not work out as smoothly as you’re hoping.