Dolgoprudny, Moscow, Russian Federation
Russian Federation
Dolgoprudny, Moscow, Russian Federation
Dolgoprudny, Russian Federation
VAK Russia 5.8.5
VAK Russia 5.8.7
UDC 004.048
UDC 347.514.3
UDC 004.75
UDC 004.8
CSCSTI 77.00
CSCSTI 20.00
Russian Classification of Professions by Education 02.00.00
Russian Classification of Professions by Education 09.00.00
Russian Classification of Professions by Education 44.00.00
Russian Library and Bibliographic Classification 3
Russian Library and Bibliographic Classification 74
BISAC COM014000 Computer Science
BISAC COM023000 Educational Software
BISAC COM074000 Hardware / Mobile Devices
BISAC COM004000 Intelligence (AI) & Semantics
BISAC COM051300 Programming / Algorithms
Relevance. The growing popularity of artificial intelligence and machine learning competitions, as well as the increasing number of participants, require specialized computing infrastructure that ensures equal conditions and isolation when working with GPU accelerators. Objective. To describe the architecture and practical implementation of a computing cluster prepared for artificial intelligence and machine learning Olympiads. Research methods. Analysis of the requirements of an onsite competition: participant isolation, identical software environments, equal allocation of GPU accelerators, persistent user data, and controlled network access. Results. The technical solution uses RKE2/Kubernetes, NVIDIA A100 GPUs partitioned with Multi-Instance GPU technology, personal JupyterLab workstations, NFS storage, secure web access, and Prometheus/Grafana monitoring. The target configuration supports 66 isolated workstations on 22 worker nodes and assigns a separate 2g.20gb GPU instance to each participant. Conclusion. The proposed infrastructure provides reproducible deployment and manageable operation of an environment for mass machine learning competitions. The novelty of the case is the adaptation of cloud-native GPU infrastructure management to the tasks of a mass onsite school artificial intelligence Olympiad.
artificial intelligence, machine learning, computing cluster, Kubernetes, RKE2, NVIDIA A100, MIG, JupyterLab, competitive programming, Olympiad, information technology in sports
1. All-Russian Olympiad in Artificial Intelligence. Official website. (in Russ.) URL: https://ai.edu.gov.ru/
2. Ministry of Education explanations on participation in Olympiad subjects with profiles. ConsultantPlus. (in Russ.) URL: https://www.consultant.ru/law/hotdocs/90793.html
3. Zacharov I., Arslanov R., Gunin M., Stefonishin D., Pavlov S. et al. “Zhores” — Petaflops supercomputer for data-driven modeling, machine learning and artificial intelligence installed in Skolkovo Institute of Science and Technology. Open Engineering, 2019, 9(1), pp. 512–520. https://doi.org/10.1515/eng-2019-0059
4. Reed D.A., Dongarra J. Exascale computing and big data. Communications of the ACM, 2015, 58(7), pp. 56–68. https://doi.org/10.1145/2699414
5. Armbrust M., Fox A., Griffith R., Joseph A.D., Katz R., Konwinski A. et al. A view of cloud computing. Communications of the ACM, 2010, 53(4), pp. 50–58. https://doi.org/10.1145/1721654.1721672
6. Kubernetes Documentation. Production-Grade Container Orchestration. URL: https://kubernetes.io/docs/
7. JupyterLab Documentation. URL: https://jupyterlab.readthedocs.io/
8. RKE2 Documentation. URL: https://docs.rke2.io/
9. Ansible Documentation. URL: https://docs.ansible.com/
10. NVIDIA. Multi-Instance GPU User Guide. URL: https://docs.nvidia.com/datacenter/tesla/miguser-guide/
11. NVIDIA. GPU Operator Documentation. URL: https://docs.nvidia.com/datacenter/cloudnative/gpu-operator/latest/
12. Kubernetes Documentation. StatefulSets. URL: https://kubernetes.io/docs/concepts/workloads/controll
13. Kubernetes Documentation. Persistent Volumes. URL: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
14. Jupyter Docker Stacks Documentation. URL: https://jupyter-docker-stacks.readthedocs.io/
15. Haynes T., Noveck D. Network File System (NFS) Version 4 Minor Version 1 Protocol, RFC 8881, 2020. https://doi.org/10.17487/RFC8881
16. Kubernetes Documentation. Network Policies. URL: https://kubernetes.io/docs/concepts/servicesnetworking/network-policies/
17. Cert-manager Documentation. URL: https://cert-manager.io/docs/
18. Prometheus Documentation. URL: https://prometheus.io/docs/
19. Grafana Documentation. URL: https://grafana.com/docs/grafana/latest/
20. NVIDIA. DCGM Exporter. URL: https://github.com/NVIDIA/dcgm-exporter
21. Burns B., Grant B., Oppenheimer D., Brewer E., Wilkes J. Borg, Omega, and Kubernetes. Communications of the ACM, 2016, 59(5), pp. 50–57. https://doi.org/10.1145/2890784
22. Merkel D. Docker: lightweight Linux containers for consistent development and deployment. Linux Journal, 2014, (239). URL: https://www.linuxjournal.com/content/docker-lightweightlinux-containers-consistent-development-and-deployment
23. Kluyver T., Ragan-Kelley B., Perez F., Granger B., Bussonnier M., Frederic J. et al. Jupyter Notebooks — a publishing format for reproducible computational workflows. In: Positioning and Power in Academic Publishing: Players, Agents and Agendas, IOS Press, 2016, pp. 87–90. https://doi.org/10.3233/978-1-61499-649-1-87
24. Zaharia M., Chowdhury M., Franklin M.J., Shenker S., Stoica I. Spark: Cluster Computing with Working Sets. Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing, 2010. URL: https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-workingsets



