Fabric for Deep Learning (FfDL) Description
Deep learning frameworks like TensorFlow and PyTorch, Torch and Torch, Theano and MXNet have helped to increase the popularity of deep-learning by reducing the time and skills required to design, train and use deep learning models. Fabric for Deep Learning (pronounced "fiddle") is a consistent way of running these deep-learning frameworks on Kubernetes. FfDL uses microservices architecture to reduce the coupling between components. It isolates component failures and keeps each component as simple and stateless as possible. Each component can be developed, tested and deployed independently. FfDL leverages the power of Kubernetes to provide a resilient, scalable and fault-tolerant deep learning framework. The platform employs a distribution and orchestration layer to allow for learning from large amounts of data in a reasonable time across multiple compute nodes.