Does the Git usage of SHA-1 *really* cause silent problems? I'm not sure how Git works internally but I was under the impression that it hashes whole objects, like individual source files at least.
The individual objects inside git aren't file.
The individual objects are commits (i.e..: the content of a patchfile, and a few information like pointer to other past commits to which this patch applies).
To make things easier, a handy number designates this commit - this is currently generated by SHA-1.
(Git is a content-addressable platform. You don't access object by name, you access them depending on their content. But instead of using the whole content to access them, you use addresses generated by SHA-1 to access the various blocks.
So to say which are the parent commits to which the patch in a commit applies, you just mention them by using the SHA-1 sum of the content of these commits).
A theoretical attack would be:
- try to generate 2 commits.
one adds a clean piece of code. the other adds a backdoored piece of code.
but both commits hash to the same SHA-1 so they would be considered as "the same content" by git.
Then try to force your target to re-download the whole repo from scratch from your backdoored history (otherwise git will simply ignore the commits with sha-1 sum that it already has - it thinks that it has the same content already).
In practice it's currently not doable.
The only thing that google managed to generate is a pair of block series. Each series contain completely random junk. Both series end-up generating the exact same shasum even if the random junk is different.
- That is exploitable in a PDF (or any other binary format that supports scripting. You could even do it in an EXE) : using the embed scripting present 2 different contents depending on which random junk is present.
- That is not exploitable in a sourcecode commit : you would need a believable explanation for why the random junk is present in the patched source code.
AND you would need a piece of code which reacts differently (normal vs. backdoor) depending on which random junk is present - to be able to pull that unnoticed would require "Underhanded C Contest"-level of ingenuity.
That's it, you only have blocks of random garbage.
Google currently can't produce hashes colliding from arbitrary pieces of data ("Hey google: here's is legit script A, and that's malicious script B. Add a small nonce at the end so they both end-up having the same sha-1sum") ("Actually don't add a nonce, that would be too conspicuous, try to tweak the punctuation in the comments instead")
Also as you mention, further edits will be problematic :
if I edit script A and submit a patch, this patch will be valid, but will completely fail on top of script B.