Optimum Intel int4 on iGPU UHD 770

I’d like to share the result of inference using Optimum Intel library with Starling-LM-7B Chat model quantized to int4 (NNCF) on iGPU Intel UHD Graphics 770 (i5 12600) with OpenVINO library.

I think it’s quite good 16 tk/s with CPU load 25-30%. Same performance with int8 (NNCF) quantization.

This is inside a Proxmox VM with SR-IOV virtualized GPU 16GB RAM and 6 cores. I also found that the ballooning device might cause crash of the VM so I disabled it while the swap is on a zram device.

free -h output while inferencing:

total used free shared buff/cache available

Mem: 15Gi 6.2Gi 573Mi 4.7Gi 13Gi 9.3Gi

Swap: 31Gi 256Ki 31Gi

Code adapted from https://github.com/OpenVINO-dev-contest/llama2.openvino

What’s your thoughts on this?

  • fakezetaOPB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 年前

    I hope that something similar emerge on Linux.

    SYCL can be a candidate, like Vulkan for 3D Acceleration: it’s a PITA to deal with CUDA, ROCm etc etc.

    • fallingdowndizzyvrB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 年前

      That’s why Intel is pitching OneAPI. They want it to be the single API to bring everything together. That’s why it also supports nvidia GPUs, AMD GPUs, CPUs and even FPGA.