GPT-4o (o for "omni") is an important step towards a more natural interaction between humans and computers. It accepts any combination as input, including text, audio and image, and can generate any combination of outputs, including text, audio and image. It can respond to audio in as little as 228 milliseconds with an average of 325 milliseconds. This is similar to the human response time in a conversation (opens in new window). It is as fast and cheaper than GPT-4 Turbo on text in English or code. However, it has a significant improvement in text in non-English language. GPT-4o performs better than existing models at audio and vision understanding.