The way some bank do it, is that the authification asker (a 2F-protected service provider) sends a signed/encrypted message, that the security token decodes/verifies/displays. That message can't be tampered with (cryptography).
So the token will display the message (something like "Authentication required to access GMail.com").
so if an attacker tries to intercept your credential by opening an actual google page in the background, you'll notice that what the thing pretends to be on screen and what the dongle register as an asker aren't the same.
The way to fool the user would be to try to look actually like the page you're trying to spoof. So an attacker needs to look like GMail, so the user thinks he's on Gmail, whereas actually it's a malware page maskarading as it and relying security tokens from the real Gmail.
Now the way that banks counter-act that, is that any critical action (payment, etc.) needs to be confirmed again by the security token system. So the theoretic man-in-the-middle can't inject payment for 10'000$ for his Cayman Islands account. Because every payment needs to be confirmed again. And the bank will issue confirmation message regarding transaction.
You'll notice if when paying a phone bill, the confirmation message instead is 10'000$ for Cayman Islands.
Overall, it works as if the security token is its very own separate device, designed to work over non-reliable non-trusty channel.
(The device doesn't implement a full TCP/IP stack. Most example device accepts only:
- a string of caracters as an input (i.e.: you need to type the last five digit of the account you need to send funds too. The bank will notice when you type the digit of your utility company, but the man-in-the-middle has tried to inject a cayman island account from your browser).
- a 2D flashing barcode to automate string input.
- for the most crazy solution: writing a string to file on a flash-disk, this flashdisk is shared with the security token's microcontroller.
Each time, the attack surface is very small. Only a short string of data is passed. You can't get much exploitable bugs.
For the output, only a string again:
- that you read and type from the token's screen.
- that the token can type on your behalf, communicating with a HID chip on the same device.
- the token can send it to a flash device that makes it visible inside a file.
Again, the security token it self is limited to send just a string. Very small attack surface. All the funny "stuff" are implemented outside, and thus very low risk of remote exploitability)