High-technology computing cluster for artificial intelligence Olympiads
Abstract and keywords
Abstract:
Relevance. The growing popularity of artificial intelligence and machine learning competitions, as well as the increasing number of participants, require specialized computing infrastructure that ensures equal conditions and isolation when working with GPU accelerators. Objective. To describe the architecture and practical implementation of a computing cluster prepared for artificial intelligence and machine learning Olympiads. Research methods. Analysis of the requirements of an onsite competition: participant isolation, identical software environments, equal allocation of GPU accelerators, persistent user data, and controlled network access. Results. The technical solution uses RKE2/Kubernetes, NVIDIA A100 GPUs partitioned with Multi-Instance GPU technology, personal JupyterLab workstations, NFS storage, secure web access, and Prometheus/Grafana monitoring. The target configuration supports 66 isolated workstations on 22 worker nodes and assigns a separate 2g.20gb GPU instance to each participant. Conclusion. The proposed infrastructure provides reproducible deployment and manageable operation of an environment for mass machine learning competitions. The novelty of the case is the adaptation of cloud-native GPU infrastructure management to the tasks of a mass onsite school artificial intelligence Olympiad.

Keywords:
artificial intelligence, machine learning, computing cluster, Kubernetes, RKE2, NVIDIA A100, MIG, JupyterLab, competitive programming, Olympiad, information technology in sports
Text
Text (PDF): Read Download
References

1. All-Russian Olympiad in Artificial Intelligence. Official website. (in Russ.) URL: https://ai.edu.gov.ru/

2. Ministry of Education explanations on participation in Olympiad subjects with profiles. ConsultantPlus. (in Russ.) URL: https://www.consultant.ru/law/hotdocs/90793.html

3. Zacharov I., Arslanov R., Gunin M., Stefonishin D., Pavlov S. et al. “Zhores” — Petaflops supercomputer for data-driven modeling, machine learning and artificial intelligence installed in Skolkovo Institute of Science and Technology. Open Engineering, 2019, 9(1), pp. 512–520. https://doi.org/10.1515/eng-2019-0059

4. Reed D.A., Dongarra J. Exascale computing and big data. Communications of the ACM, 2015, 58(7), pp. 56–68. https://doi.org/10.1145/2699414

5. Armbrust M., Fox A., Griffith R., Joseph A.D., Katz R., Konwinski A. et al. A view of cloud computing. Communications of the ACM, 2010, 53(4), pp. 50–58. https://doi.org/10.1145/1721654.1721672

6. Kubernetes Documentation. Production-Grade Container Orchestration. URL: https://kubernetes.io/docs/

7. JupyterLab Documentation. URL: https://jupyterlab.readthedocs.io/

8. RKE2 Documentation. URL: https://docs.rke2.io/

9. Ansible Documentation. URL: https://docs.ansible.com/

10. NVIDIA. Multi-Instance GPU User Guide. URL: https://docs.nvidia.com/datacenter/tesla/miguser-guide/

11. NVIDIA. GPU Operator Documentation. URL: https://docs.nvidia.com/datacenter/cloudnative/gpu-operator/latest/

12. Kubernetes Documentation. StatefulSets. URL: https://kubernetes.io/docs/concepts/workloads/controll

13. Kubernetes Documentation. Persistent Volumes. URL: https://kubernetes.io/docs/concepts/storage/persistent-volumes/

14. Jupyter Docker Stacks Documentation. URL: https://jupyter-docker-stacks.readthedocs.io/

15. Haynes T., Noveck D. Network File System (NFS) Version 4 Minor Version 1 Protocol, RFC 8881, 2020. https://doi.org/10.17487/RFC8881

16. Kubernetes Documentation. Network Policies. URL: https://kubernetes.io/docs/concepts/servicesnetworking/network-policies/

17. Cert-manager Documentation. URL: https://cert-manager.io/docs/

18. Prometheus Documentation. URL: https://prometheus.io/docs/

19. Grafana Documentation. URL: https://grafana.com/docs/grafana/latest/

20. NVIDIA. DCGM Exporter. URL: https://github.com/NVIDIA/dcgm-exporter

21. Burns B., Grant B., Oppenheimer D., Brewer E., Wilkes J. Borg, Omega, and Kubernetes. Communications of the ACM, 2016, 59(5), pp. 50–57. https://doi.org/10.1145/2890784

22. Merkel D. Docker: lightweight Linux containers for consistent development and deployment. Linux Journal, 2014, (239). URL: https://www.linuxjournal.com/content/docker-lightweightlinux-containers-consistent-development-and-deployment

23. Kluyver T., Ragan-Kelley B., Perez F., Granger B., Bussonnier M., Frederic J. et al. Jupyter Notebooks — a publishing format for reproducible computational workflows. In: Positioning and Power in Academic Publishing: Players, Agents and Agendas, IOS Press, 2016, pp. 87–90. https://doi.org/10.3233/978-1-61499-649-1-87

24. Zaharia M., Chowdhury M., Franklin M.J., Shenker S., Stoica I. Spark: Cluster Computing with Working Sets. Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing, 2010. URL: https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-workingsets


Login or Create
* Forgot password?