With so little detail, it's hard to say for sure, but it sounds like you were likely wrong.
First, the fact that something is flushed from a register is NO guarantee that it will be visible to other processors at any particular time in the future. Sure, most architectures and implementations with most workloads will move it "fairly quickly" from caches to 'main' memory shared by all processors. But "fairly quickly" is undefined - one instruction? Two? 20? Seven clock cycles? Six?. Worse, once the register is actually flushed to main memory by the processor doing the write, there's no guarantee that other processors will see the updated value when accessing the physical address at any particular time in the future -- the 'stale' version of the data may be sitting in L1 or L2 cache for an arbitrary period of time (and this is usually even longer and more unpredictable and, in my experience, the more common source of concurrency problems in code written by amateurs who have never carefully read even the memory model section of even one hardware architecture spec).
Second, C makes no guarantees about when a register (or even what a 'register' is - esp. on a particular architecture) is flushed to to the L1/L2/L? cache/memory. Particular implementations on particular architectures at particular optimization levels may do so when you expect -- but they also may do something quite different.
When you're programming in a high level language, you should adhere to the rules of the high level language and not make assumptions about how the compiler, runtime, and underlying architecture work as you have no idea when a new version of compiler, a new version of runtime, or a new revision of hardware etc will be used with your code and invalidate your assumptions -- and that's not even considering porting to an alternative architecture in the future.
The fact that an OS vendor coded something a particular way does not mean that it's a good idea for you to do it. They know what hardware their OS targets (including multiple architectures, each of which will likely have different implementations of key concurrency control primitives) and will patch their code if needed. As well, in the case of Microsoft, both the microprocessor vendors and (to a lesser extent) system builders test their implementations on Windows before alpha release so if there's a discrepancy, either the hardware will likely be changed or Microsoft will issue a bug fix for that hardware. Note that the underlying OS even almost certainly has all sorts of "hacks" in it to get around errata in specific steppings of Intel microprocessors. Depending on the OS, just stepping through the assembly code for a library or system call may be especially misleading unless you have verified that that's really the code that runs on EVERY implementation of the hardware/OS etc.
Of course, if you know that your code would never need to run on another compiler, run-time, architecture, implementation of the architecture (including clock rates, memory speeds), or even stepping of a microprocessor and you understand all the layers well (perhaps implemented them yourself) have at it -- but, please, do us all a favor and just write your code in assembly so it's clearer to those who may follow you what you did rather than leave others trying to guess what you assumed about the underlying infrastructure.
For the rest of us that know things change, when writing in a high level language, be professional and follow the specs and use available OS libraries for synchronization needs where available. I've had to implement my own in the long distant past (including to replace a horrible implementation of CriticalSections back in the early days of Windows NT), but those days are mostly in the distant rearview mirror now that most everything has been multicore for at least 15 years.
If you're even thinking about 'registers' when considering correctness of concurrency when programming in a high level language, you are almost certainly doing something wrong.