You assume that a standards document exists and is also sufficiently specific for all scenarios. Other than some very fundamental IETF stuff have I seen a standards document that pretty much covers the scope specifically. Even more severely, "specifications" for an internal project have been so traditionally bad, a whole methodology cropped up basically saying that getting specifications that specifically correct is a waste of time because during the coding it will turn out to not be workable.
Yes, it can write hundreds of tests, but if the same mediocre engine that can't code it right is also generating tests, the tests will be mediocre. Leading to bizarre things like a test case to make sure '1234' comes back as 'abcd' and the function just always returns the fixed string 'abcd' and passes the test because it decided to make a test and pass it instead of trying to implement the logic. I have seen people almost superstitiously add to a prompt "and test everything to make sure it's correct" and declare "that'll fix the problems". The superstitious prompting is a big problem in my mind, that people think they add a magic phrase and suddenly the LLM won't make the mistakes LLMs tend to make. I have seen people take an LLM at their word when the LLM "promises" to not make a specific mistake, and then confounded the first time they hit the LLM making the mistake anyway. "It specifically said it wouldn't do that!", it doesn't understand promises, the thing just will generate the 'consistent' followup to a demand for a promise which is text indicating making the promise.
Take the experiment where they took Opus 4.6 and made it produce a C compiler. To do so, the guy at Anthropic said point blank he had to invest a great deal of effort in a test harness, that the process needed an already working gcc to use as a reference on top of that, and specified the end game as a bootable, compiled kernel. Even then he had to intervene to fix it and it couldn't do the whole thing and when people reviewed the published result, it failed to compile other valid code and managed to compile things that shouldn't have been compilable. This is Anthropic with their best model doing a silly stunt to create a knock off of an existing open source project with full access to said project and source code and *still* it being a lot of human work for mediocre output.
Yes, it has utility, but there's a lot of people overestimating capabilities and underestimating risks and it's hard for the non-technical decision makers to tell the difference until much further down the line. Mileage varies greatly depending on the nature of the task at hand as to whether LLM is barely useful at all or it can credibly almost generate the whole thing.