Hybrid Cloud and AI Integration for Scalable Data Engineering: Innovations in Enterprise AI Infrastructure
Abstract
Training and deploying Deep Learning models in real-world applications often involve processing large amounts of data. There is an active research community working on building software and hardware infrastructure to address these big data challenges, particularly focusing on building highly optimized solutions and large footprints of parallel computers. Hyper focuses on the complementary set of problems in the Deep Learning ecosystem to lower the barrier of entry to the field. Hyper proposes a hybrid distributed cloud framework that simplifies the hardware and software infrastructure for large-scale distributed computing tasks.
The Hyper framework offers a unified view to multiple clouds and on-premise infrastructure for processing tasks using both CPU and GPU compute instances at scale. In the proposed system, the researcher implements a distributed file system and a failure-tolerant task processing scheduler, which are independent of the language and Deep Learning framework used. As a result, the framework assists researchers in exploiting the unused and cheap resources that are prone to become statistically more powerful tools in the community. To clearly demonstrate the cost-efficiency of the system, the researcher provides a detailed table showcasing the quantitative evaluation of Hyper usage costs. In real-world applications, deploying Deep Models is often non-trivial and can include multiple steps ranging from extensive post-processing of the obtained scores to the containment of the numerous pre-processing transformations of the data. The portability and generality of the framework is demonstrated by discussing the scalability of different and non-trivial real-life setups. These tasks include pre-processing, distributed training, hyperparameter search, and large-scale inference tasks, showcasing usage costs and total running times.