These aren't placeholders. The video I've seen shows a corrupted image arguably from another user's camera, and not something you'd show as a placeholder.
Xiaomi's explanation doesn't make much sense given the symptoms -- why would caching produce issues only when the network is problematic, and then return corrupted images? (Usually, when I see "caching," I think: You messed up your CDN configuration and are allowing the CDN to cache private data, but that isn't the case here.)
My best guess given the symptoms is they weren't checking for an error condition on the server. Basically, the server expects the camera to upload a still image; they allocate a buffer based on the Content-Length header, but due to the aforementioned network issues the connection dies during the upload. Because they were ignoring the error condition, the upload buffer was only partly initialized. The beginning of the buffer has some uploaded data, but the remainder is whatever was in memory before would be stored as the image you uploaded.
If access patterns are very predictable (e.g., this server is only used to handle uploaded images, the buffer allocated is the same as the previously returned pointer value), you could very well see the image that was previously uploaded. This assumes they're using C or C++ (or features from another language which allows uninitialized memory access) on the server, though; that's not exactly a common way of coding web servers today, though not infeasible.