binaural = stereo
Actually in the audio world, "binaural" is used to specifically mean a recording intended for being played directly into the ears.
I was once present for a binaural recording session. The guy doing the recording had brought a fake human head, and the two microphones for the recoding were positioned in the two ears. The idea was to reproduce as fully as possible what you would have heard if you had been sitting in that spot in the room, with your head in that position.
You can listen to a binaural recording on speakers of course, but for the best experience you should use headphones.
For the absolute best experience, the recording should use a fake head that is exactly like your head. Not many people are ever going to experience that.
Audio can do funny things as it travels around your head. For the absolute best 3D experience with headphones, you want to measure what happens to audio around your head; this is called your "Head-Related Transfer Function" or "HRTF". Instead of recording the audio with a fake head shaped just like yours, companies can just record a good 5.1 or 7.1 recording, and then you can mix that down to a binaural stereo mix that is perfect for your head if you have your HRTF. According to the article the AES is standardizing a file format for HRTF data, so that the software you get will be more likely to be able to work with your HRTF data if you have it measured.
The ultimate in VR audio will be headphones with motion tracking, and real-time mixing that uses your HRTF and changes the mix as you turn your head. If something is supposed to be coming from your left, and you turn your head to the left, that sound should get louder; then if you turn your head away from it, it should get quieter. If this is done right it should be incredible. People have been working on this for years and I'm sure someone somewhere has done it right, but I haven't seen any commonly available products to do it yet.
But with VR goggles you should totally have VR audio like I described above. It would be really immersive.
3d audio = surround sound (5.1/7.1/8.1/etc)
Pretty much, 3D audio is intended to include speakers above the plane of the 5.1 or 7.1 speaker setup; the industry calls these "height speakers". DTS 11.1 audio, for example, has a standard 7.1 setup, and then 4 height speakers: two in the front and two in the back.
The current ultimate in 3D audio is a 22.2 setup, where the ceiling has a 3x3 array of speakers, there are speakers at mid height, and there are speakers at ground level. However, IMHO there is zero chance that 22.2 will catch on as an audio standard.
Before the 5.1 and 7.1 digital standards, there was Dolby Surround that was encoded within a stereo soundtrack. A simple audio mixer could "upmix" from stereo to surround. DTS Neural Upmix can make a very clean 7.1 from a stereo signal, and it works from an analog signal (it's not something tricky inside a digital encoded format). You can't get 8 kilograms of flour into a 2-kilo bag, and Neural Upmix 7.1 can't completely reproduce the same mix as you can play through 8 discrete channels, but it can provide a good experience.
DTS 11.1, as I understand it, uses technology similar to DTS Neural Upmix to encode the 4 "height" channels within the other 7.1 channels. Turning 7.1 into 11.1 should be a lot easier than turning 2.0 into 7.1 so it should provide a good experience.
I expect the industry to go to "object oriented" audio. This means that audio will have metadata tags saying what direction the audio is coming from, and then a real-time mixer upmixes from the digital format with the metadata tags to whatever mix you need (i.e. if you have 11.1 speakers you get an 11.1 mix, if you actually have 22.2 speakers you get that, if you have 7.1 you get that, etc.) I believe Dolby Atmos works this way, and I believe DTS will be coming out with something similar.
Few people even want a standard with 24 discrete channels of audio. It just makes more sense to encode the audio you need in a digital format and then mix it on the fly. In a 22.2 audio mix, if there is no sound coming from overhead, you have 9 channels being used just to play silence; with object-oriented you would simply not have any encoded signal tagged to be coming from overhead.