I've been playing with these genAI system both as code producer but as helper on various tasks.
And overall, I find the models quite brittle unless they are fine tuned on the precise task that you want.
The main problem that I see is that the tool is fundamentally a string in string out. But the strings could be absolutely anything including completely insane things without proper fine tuning.
Today, I am writing a simple automatic typo correction tool. The difficult bits are making sure that the tool didn't crap out. I mean, it is easy to check you actually got an answer from the tool. The problem is that sometimes the tools will tell you: "Sure, I can fix typos. Here is your text corrected: ". Ans so you have to toss that output out probably. But how do you figure out that it shat the bed? Well, you can't really, it is just as hard as the original task in some cases. So you bake various heuristics, or you get a different LLM to check the work of the first one.
At the end of the day, you really can't trust anything these tools do. They are way to erratic and unpredictible. And you should treat any of these tools as being possibly adversarial. It's exhausting to use really.