Amazon Elastic Inference Description
Amazon Elastic Inference allows for low-cost GPU-powered acceleration to Amazon EC2 instances and Sagemaker instances, or Amazon ECS tasks. This can reduce the cost of deep learning inference by up 75%. Amazon Elastic Inference supports TensorFlow and Apache MXNet models. Inference is the process by which a trained model makes predictions. Inference can account for as much as 90% of total operational expenses in deep learning applications for two reasons. First, standalone GPU instances are usually used for model training and not inference. Inference jobs typically process one input at a time and use a smaller amount of GPU compute. Training jobs can process hundreds of data samples simultaneously, but inference jobs only process one input in real-time. This makes standalone GPU-based inference expensive. However, standalone CPU instances aren't specialized for matrix operations and are therefore often too slow to perform deep learning inference.