GMKTec Evo-X2 Ryzen AI Max 395+ Benchmarks

I recently got my hands on a GMKTec Evo X2 for local model inference.

Here's my hardware details

nish@gmktec-evo-x2:~$ sudo lshw -short
H/W path          Device          Class       Description
=========================================================
                                  system      NucBox_EVO-X2 (EVO-X2-001)
/0                                bus         GMKtec
/0/0                              memory      64KiB BIOS
/0/b                              memory      1280KiB L1 cache
/0/c                              memory      16MiB L2 cache
/0/d                              memory      64MiB L3 cache
/0/e                              processor   AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
/0/11                             memory      128GiB System Memory

The box came with Windows 11 Pro preinstalled which I didn't bother with and quickly replaced with Ubuntu Server.

nish@gmktec-evo-x2:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.3 LTS
Release:        24.04
Codename:       noble

Out of the box performance

I installed ollama and tested a few models using the verbose option. It's worth noting these were out of the box with no additional drivers or tooling installed. My prompt was "What is the distance between the earth and the Sun?"

I started with gpt-oss:20b.

total duration:       11.558212329s
load duration:        98.524563ms
prompt eval count:    76 token(s)
prompt eval duration: 39.185462ms
prompt eval rate:     1939.49 tokens/s
eval count:           270 token(s)
eval duration:        11.346835974s
eval rate:            23.80 tokens/s

gpt-oss-120b was next which showed decent performance.


total duration:       24.107760366s
load duration:        218.341745ms
prompt eval count:    77 token(s)
prompt eval duration: 4.975562972s
prompt eval rate:     15.48 tokens/s
eval count:           277 token(s)
eval duration:        18.757952199s
eval rate:            14.77 tokens/s

I tested qwen3:32b and was quite disappointed with the performance.

total duration:       5m16.79871007s
load duration:        47.442582ms
prompt eval count:    19 token(s)
prompt eval duration: 924.354931ms
prompt eval rate:     20.55 tokens/s
eval count:           1393 token(s)
eval duration:        5m15.38993452s
eval rate:            4.42 tokens/s

Rocm and AMD GPU driver installation

Next I installed, Rocm and the AMD GPU Driver by following the instructions here. I was quite surprised that the Rocm installation required 23GB of disk space.

$ sudo apt install rocm

...

Need to get 5345 MB of archives.
After this operation, 23.0 GB of additional disk space will be used.
Do you want to continue? [Y/n]

I verified the installation by using rocm-smi

$ rocm-smi

======================================== ROCm System Management Interface ========================================
================================================== Concise Info ==================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK  MCLK  Fan  Perf  PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)
==================================================================================================================
0       1     0x1586,   40251  27.0°C  5.083W    N/A, N/A, 0         N/A   N/A   0%   auto  N/A     0%     0%
==================================================================================================================
============================================== End of ROCm SMI Log ===============================================

Testing qwen3:32b showed improved performance. I assume this is the result of the updated drivers.

total duration:       1m49.043363047s
load duration:        51.078806ms
prompt eval count:    20 token(s)
prompt eval duration: 202.439512ms
prompt eval rate:     98.79 tokens/s
eval count:           1021 token(s)
eval duration:        1m48.316184545s
eval rate:            9.43 tokens/s

gpt-oss:120b also showed improved performance.

total duration:       9.300572016s
load duration:        100.106345ms
prompt eval count:    77 token(s)
prompt eval duration: 144.640986ms
prompt eval rate:     532.35 tokens/s
eval count:           295 token(s)
eval duration:        8.925786695s
eval rate:            33.05 tokens/s

Just under 50tps for gpt-oss:20b!

total duration:       7.016576027s
load duration:        96.902471ms
prompt eval count:    77 token(s)
prompt eval duration: 159.437642ms
prompt eval rate:     482.95 tokens/s
eval count:           305 token(s)
eval duration:        6.602954724s
eval rate:            46.19 tokens/s

Llama.cpp

sudo apt install build-essential cmake libcurl4-openssl-dev

When building llama.cpp for AMD GPUs the HIP build instructions require an AMDGPU_TARGET to be set. I found this using rocminfo.

$ rocminfo
ROCk module version 6.14.14 is loaded

...

Agent 2                  
*******                  
  Name:                    gfx1151                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics            

Then I ran a build using

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx1151 \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DGGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
&& cmake --build build --config Release -- -j 16

Then I tested it against ggml-org/gemma-3-1b-it-GGUF.

$  ./build/bin/llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

llama_perf_sampler_print:    sampling time =      26.93 ms /   366 runs   (    0.07 ms per token, 13590.29 tokens per second)
llama_perf_context_print:        load time =     491.51 ms
llama_perf_context_print: prompt eval time =      34.05 ms /    19 tokens (    1.79 ms per token,   557.99 tokens per second)
llama_perf_context_print:        eval time =    2141.20 ms /   347 runs   (    6.17 ms per token,   162.06 tokens per second)
llama_perf_context_print:       total time =   14215.85 ms /   366 tokens
llama_perf_context_print:    graphs reused =        345
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - ROCm0 (Graphics)   | 65536 = 63876 + (1314 =   762 +      38 +     514) +         345 |
llama_memory_breakdown_print: |   - Host               |                   318 =   306 +       0 +      12                |

I pulled the same model using ollama to compare. I believe llama.cpp uses the q4k_m model by default so it should be fair.

$ ollama run gemma3:1b --verbose

total duration:       1.979890959s
load duration:        124.143692ms
prompt eval count:    19 token(s)
prompt eval duration: 35.132589ms
prompt eval rate:     540.81 tokens/s
eval count:           271 token(s)
eval duration:        1.718437609s
eval rate:            157.70 tokens/s

gpt-oss:120b managed to hit 45tps

llama_perf_sampler_print:    sampling time =      27.84 ms /   281 runs   (    0.10 ms per token, 10092.67 tokens per second)
llama_perf_context_print:        load time =   11317.00 ms
llama_perf_context_print: prompt eval time =     138.78 ms /    16 tokens (    8.67 ms per token,   115.29 tokens per second)
llama_perf_context_print:        eval time =    5828.50 ms /   264 runs   (   22.08 ms per token,    45.29 tokens per second)
llama_perf_context_print:       total time =  593306.52 ms /   280 tokens
llama_perf_context_print:    graphs reused =        262
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - ROCm0 (Graphics)   | 65536 = 4734 + (60421 = 59851 +     171 +     398) +         380 |
llama_memory_breakdown_print: |   - Host               |                   601 =   586 +       0 +      15                |

gpt-oss:20b reports 65tps

llama_perf_sampler_print:    sampling time =      12.83 ms /   263 runs   (    0.05 ms per token, 20495.64 tokens per second)
llama_perf_context_print:        load time =    1017.53 ms
llama_perf_context_print: prompt eval time =      72.33 ms /    16 tokens (    4.52 ms per token,   221.20 tokens per second)
llama_perf_context_print:        eval time =    3754.42 ms /   246 runs   (   15.26 ms per token,    65.52 tokens per second)
llama_perf_context_print:       total time =  186207.11 ms /   262 tokens
llama_perf_context_print:    graphs reused =        244
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - ROCm0 (Graphics)   | 65536 = 53694 + (11461 = 10949 +     114 +     398) +         380 |
llama_memory_breakdown_print: |   - Host               |                    601 =   586 +       0 +      15                |

iGPU tweaks

To ensure that all of the VRAM (128GB) is addressable for bigger models, I made a few adjustments in the BIOS sourced from here.

  1. Set UMA frame buffer size to 1G (This was the minimum in my bios). Interestingly this is the value that gets reported by rocm-smi --showmeminfo vram[1]
    ============================ ROCm System Management Interface ============================
    ================================== Memory Usage (Bytes) ==================================
    GPU[0]          : VRAM Total Memory (B): 1073741824
    GPU[0]          : VRAM Total Used Memory (B): 163188736
    ==========================================================================================
    ================================== End of ROCm SMI Log ===================================
    
  2. Disable IOMMU

Next I added the following kernel boot options to GRUB to set the GTT and TTM sizes.

$ sudo nano /etc/default/grub

# Update GRUB_CMDLINE_LINUX_DEFAULT to 
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432"

I verified this using

$ sudo dmesg | grep -i gtt

[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0-86-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 vt.handoff=7
[    0.068142] Kernel command line: BOOT_IMAGE=/vmlinuz-6.8.0-86-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 vt.handoff=7
[    3.527604] amdgpu 0000:c5:00.0: amdgpu: [drm] Configuring gttsize via module parameter is deprecated, please use ttm.pages_limit
[    3.527604] amdgpu 0000:c5:00.0: amdgpu: [drm] GTT size has been set as 137438953472 but TTM size has been set as 66813538304, this is unusual
[    3.527605] [drm] amdgpu: 131072M of GTT memory ready.

and

$ sudo dmesg | grep -i ttm

[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0-86-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 vt.handoff=7
[    0.068142] Kernel command line: BOOT_IMAGE=/vmlinuz-6.8.0-86-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=off amdgpu.gttsize=131072ttm.pages_limit=33554432 vt.handoff=7
[    3.527604] amdgpu 0000:c5:00.0: amdgpu: [drm] Configuring gttsize via module parameter is deprecated, please use ttm.pages_limit
[    3.527604] amdgpu 0000:c5:00.0: amdgpu: [drm] GTT size has been set as 137438953472 but TTM size has been set as 66813538304, this is unusual

Llama bench

To compare performance with the DGX Spark, I ran llama-bench with params I found here.

./build/bin/llama-bench -m model.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 --mmap 0

The all have the preamble

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32

and end with

build: 03792ad9 (6816)

gpt-oss:20b

model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 1 pp2048 1621.75 ± 122.61
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 1 tg32 65.73 ± 0.07
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 1 pp2048 @ d4096 1172.54 ± 1.82
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 1 tg32 @ d4096 59.53 ± 0.06
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 1 pp2048 @ d8192 950.99 ± 1.95
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 1 tg32 @ d8192 57.25 ± 0.06
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 1 pp2048 @ d16384 695.44 ± 0.78
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 1 tg32 @ d16384 53.79 ± 0.05
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 1 pp2048 @ d32768 451.42 ± 0.54
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 1 tg32 @ d32768 47.71 ± 0.06

gpt-oss:120b

model size params backend ngl n_ubatch fa mmap test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 0 pp2048 818.11 ± 9.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 0 tg32 46.05 ± 0.18
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 0 pp2048 @ d4096 650.83 ± 2.16
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 0 tg32 @ d4096 42.45 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 0 pp2048 @ d8192 542.66 ± 1.71
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 0 tg32 @ d8192 40.88 ± 0.04
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 0 pp2048 @ d16384 411.87 ± 1.60
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 0 tg32 @ d16384 38.39 ± 0.06
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 0 pp2048 @ d32768 274.69 ± 0.65
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 0 tg32 @ d32768 34.15 ± 0.01

Qwen3 Coder 30B A3B

model size params backend ngl n_ubatch fa mmap test t/s
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B ROCm 99 2048 1 0 pp2048 773.45 ± 44.44
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B ROCm 99 2048 1 0 tg32 50.16 ± 0.22
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B ROCm 99 2048 1 0 pp2048 @ d4096 534.51 ± 1.19
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B ROCm 99 2048 1 0 tg32 @ d4096 44.28 ± 0.03
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B ROCm 99 2048 1 0 pp2048 @ d8192 407.36 ± 0.54
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B ROCm 99 2048 1 0 tg32 @ d8192 40.25 ± 0.03
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B ROCm 99 2048 1 0 pp2048 @ d16384 274.46 ± 0.34
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B ROCm 99 2048 1 0 tg32 @ d16384 34.89 ± 0.03
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B ROCm 99 2048 1 0 pp2048 @ d32768 166.77 ± 0.24
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B ROCm 99 2048 1 0 tg32 @ d32768 27.59 ± 0.01

  1. From what I can tell rocm-smi doesn't report VRAM usage correctly. A more accurate reflection of VRAM consumption seems to be given by free -h. ↩︎

Subscribe to Another Dev's Two Cents

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe