🍭

(230929) Diary: Deep Learning Compiler & Optimization 자료 정리

[딥러닝 컴파일러 & 최적화] 최근 회사에서 새로운 서비스 오픈 준비로 정신없는 나날을 보냄.. 이 과정에서 스스로 모델 서빙 관련 지식이 부족함을 느끼고 공부한 기초 자료들을 정리

Deep Learning Compiler

딥러닝에서 컴파일 (Compile)은 Tensorflow, PyTorch 등 딥러닝 프레임워크에서 작성한 모델을 특정 하드웨어 (GPU, NPU..) 혹은 백엔드 (CUDA, cuDNN..)에서 실행할 수 있는 코드로 변환하는 과정을 의미
(Apache TVM Introduction)
(MLC LLM Introduction)
1. Introduction — Machine Learing Compilation 0.0.1 documentation
Machine learning applications have undoubtedly become ubiquitous. We get smart home devices powered by natural language processing and speech recognition models, computer vision models serve as backbones in autonomous driving, and recommender systems help us discover new content as we explore. Observing the rich environments where AI apps run is also quite fun. Recommender systems are usually deployed on the cloud platforms by the companies that provide the services. When we talk about autonomous driving, the natural things that pop up in our heads are powerful GPUs or specialized computing devices on vehicles. We use intelligent applications on our phones to recognize flowers in our garden and how to tend them. An increasing amount of IoT sensors also come with AI built into those tiny chips. If we drill down deeper into those environments, there are an even greater amount of diversities involved. Even for environments that belong to the same category(e.g. cloud), there are questions about the hardware(ARM or x86), operation system, container execution environment, runtime library variants, or the kind of accelerators involved. Quite some heavy liftings are needed to bring a smart machine learning model from the development phase to these production environments. Even for the environments that we are most familiar with (e.g. on GPUs), extending machine learning models to use a non-standard set of operations would involve a good amount of engineering. Many of the above examples are related to machine learning inference — the process of making predictions after obtaining model weights. We also start to see an important trend of deploying training processes themselves onto different environments. These applications come from the need to keep model updates local to users’ devices for privacy protection reasons or scaling the learning of models onto a distributed cluster of nodes. The different modeling choices and inference/training scenarios add even more complexity to the productionisation of machine learning.
컴파일을 통해 모델을 원하는 하드웨어 상에서 구동시킬 수 있을 뿐 아니라.. 추론 속도 향상의 효과를 얻을 수 있음
PyTorch는 가장 대표적인 딥러닝 프레임워크 중 하나로 딥러닝 분야 전반에서 흔히 사용되는데, Python의 Interpreter 언어적인 단점 (느린 실행 속도)을 그대로 갖고 있음
이는 TorchScript, PyTorch JIT (Just-in-Time) Compiler를 활용하여 PyTorch 모델을 컴파일 하거나 순수 C/C++로 모델을 재구현하는 (llama.cpp과 같은..?) 방식으로 극복할 수 있음
(TorchScript 설명 좋은 블로그 )
PyTorch 2.0이 출시된 현재에는 TorchScript보다 torch.compile 기능을 활용하는 편을 권장
(PyTorch 2.0 설명 좋은 블로그 )
추론 속도 향상은 단순 코드 컴파일 측면을 넘어.. 컴파일을 통해 계산된 모델의 연산 그래프를 최적화 함으로써 이끌어 낼 수 있음
TensorRT를 비롯하여 TVM, MLC LLM 등 딥러닝 컴파일러들은 일반적으로 Quantization과 같은 최적화 기능을 함께 제공