🍭

(230929) Diary: Deep Learning Compiler & Optimization 자료 정리

[딥러닝 컴파일러 & 최적화] 최근 회사에서 새로운 서비스 오픈 준비로 정신없는 나날을 보냄.. 이 과정에서 스스로 모델 서빙 관련 지식이 부족함을 느끼고 공부한 기초 자료들을 정리

Deep Learning Compiler

•

딥러닝에서 컴파일 (Compile)은 Tensorflow, PyTorch 등 딥러닝 프레임워크에서 작성한 모델을 특정 하드웨어 (GPU, NPU..) 혹은 백엔드 (CUDA, cuDNN..)에서 실행할 수 있는 코드로 변환하는 과정을 의미

◦

(Apache TVM Introduction)

Introduction — tvm 0.14.dev0 documentation

Versions

https://tvm.apache.org/docs/tutorial/introduction.html

◦

(MLC LLM Introduction)

1. Introduction — Machine Learing Compilation 0.0.1 documentation

Machine learning applications have undoubtedly become ubiquitous. We get smart home devices powered by natural language processing and speech recognition models, computer vision models serve as backbones in autonomous driving, and recommender systems help us discover new content as we explore. Observing the rich environments where AI apps run is also quite fun. Recommender systems are usually deployed on the cloud platforms by the companies that provide the services. When we talk about autonomous driving, the natural things that pop up in our heads are powerful GPUs or specialized computing devices on vehicles. We use intelligent applications on our phones to recognize flowers in our garden and how to tend them. An increasing amount of IoT sensors also come with AI built into those tiny chips. If we drill down deeper into those environments, there are an even greater amount of diversities involved. Even for environments that belong to the same category(e.g. cloud), there are questions about the hardware(ARM or x86), operation system, container execution environment, runtime library variants, or the kind of accelerators involved. Quite some heavy liftings are needed to bring a smart machine learning model from the development phase to these production environments. Even for the environments that we are most familiar with (e.g. on GPUs), extending machine learning models to use a non-standard set of operations would involve a good amount of engineering. Many of the above examples are related to machine learning inference — the process of making predictions after obtaining model weights. We also start to see an important trend of deploying training processes themselves onto different environments. These applications come from the need to keep model updates local to users’ devices for privacy protection reasons or scaling the learning of models onto a distributed cluster of nodes. The different modeling choices and inference/training scenarios add even more complexity to the productionisation of machine learning.

https://mlc.ai/chapter_introduction/index.html#what-is-ml-compilation

•

컴파일을 통해 모델을 원하는 하드웨어 상에서 구동시킬 수 있을 뿐 아니라.. 추론 속도 향상의 효과를 얻을 수 있음

•

PyTorch는 가장 대표적인 딥러닝 프레임워크 중 하나로 딥러닝 분야 전반에서 흔히 사용되는데, Python의 Interpreter 언어적인 단점 (느린 실행 속도)을 그대로 갖고 있음

•

이는 TorchScript, PyTorch JIT (Just-in-Time) Compiler를 활용하여 PyTorch 모델을 컴파일 하거나 순수 C/C++로 모델을 재구현하는 (llama.cpp과 같은..?) 방식으로 극복할 수 있음

◦

(TorchScript 설명 좋은 블로그 )

딥러닝 모델 배포하기 #02 - TorchScript & Pytorch JIT

AI Researcher 관점에서 모델 배포를 설명합니다.

https://happy-jihye.github.io/dl/torch-2/

•

PyTorch 2.0이 출시된 현재에는 TorchScript보다 torch.compile 기능을 활용하는 편을 권장

◦

(PyTorch 2.0 설명 좋은 블로그 )

Welcome to JunYoung's blog | Pytorch 2.0에서 어떤 점이 변했을까??

New version of pytorch

https://junia3.github.io/blog/pytorch2

•

추론 속도 향상은 단순 코드 컴파일 측면을 넘어.. 컴파일을 통해 계산된 모델의 연산 그래프를 최적화 함으로써 이끌어 낼 수 있음

•

TensorRT는 모델의 연산 그래프를 NVIDIA GPU 연산에 적합한 방식으로 최적화하고, Tensorflow나 PyTorch로 작성한 모델을 ONNX 형식으로 변환하는 것도 이와 비슷한 이유

◦

(TensorRT 설명 좋은 블로그 )

[TensorRT] NVIDIA TensorRT 개념, 설치방법, 사용하기

1. TensorRT 란? 2. TensorRT 설치하기 3. 여러 프레임워크에서 TensorRT 사용하기 1. TensorRT 란? TensorRT는 학습된 딥러닝 모델을 최적화하여 NVIDIA GPU 상에서의 추론 속도를 수배 ~ 수십배 까지 향상시켜 딥러닝 서비스를 개선하는데 도움을 줄 수 있는 모델 최적화 엔진이다. 흔히들 우리가 접하는 Caffe, Pytorch, TensorFlow, PaddlePaddle 등의 딥러닝 프레임워크를 통해 짜여진 딥러닝 모델을 TensorRT를 통해 모델을 최적화하여 TESLA T4 , JETSON TX2, TESLA V100 등의 NVIDIA GPU 플랫폼에 아름답게 싣는 것이다. 또한, TensorRT는 NVIDIA GPU 연산에 적합한 최적화 기법들을 이용하..

https://eehoeskrap.tistory.com/414

•

TensorRT를 비롯하여 TVM, MLC LLM 등 딥러닝 컴파일러들은 일반적으로 Quantization과 같은 최적화 기능을 함께 제공