GMKTec Evo-X2 Ryzen AI Max 395+ Benchmarks
I recently got my hands on a GMKTec Evo X2 for local model inference.
Here's my hardware details
nish@gmktec-evo-x2:~$ sudo lshw -short
H/W path          Device          Class       Description
=========================================================
                                  system      NucBox_EVO-X2 (EVO-X2-001)
/0                                bus         GMKtec
/0/0                              memory      64KiB BIOS
/0/b                              memory      1280KiB L1 cache
/0/c                              memory      16MiB L2 cache
/0/d                              memory      64MiB L3 cache
/0/e                              processor   AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
/0/11                             memory      128GiB System Memory
The box came with Windows 11 Pro preinstalled which I didn't bother with and quickly replaced with Ubuntu Server.
nish@gmktec-evo-x2:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.3 LTS
Release:        24.04
Codename:       noble
Out of the box performance
I installed ollama and tested a few models using the verbose option. It's worth noting these were out of the box with no additional drivers or tooling installed. My prompt was "What is the distance between the earth and the Sun?"
I started with gpt-oss:20b.
total duration:       11.558212329s
load duration:        98.524563ms
prompt eval count:    76 token(s)
prompt eval duration: 39.185462ms
prompt eval rate:     1939.49 tokens/s
eval count:           270 token(s)
eval duration:        11.346835974s
eval rate:            23.80 tokens/s
gpt-oss-120b was next which showed decent performance.
total duration:       24.107760366s
load duration:        218.341745ms
prompt eval count:    77 token(s)
prompt eval duration: 4.975562972s
prompt eval rate:     15.48 tokens/s
eval count:           277 token(s)
eval duration:        18.757952199s
eval rate:            14.77 tokens/s
I tested qwen3:32b and was quite disappointed with the performance.
total duration:       5m16.79871007s
load duration:        47.442582ms
prompt eval count:    19 token(s)
prompt eval duration: 924.354931ms
prompt eval rate:     20.55 tokens/s
eval count:           1393 token(s)
eval duration:        5m15.38993452s
eval rate:            4.42 tokens/s
Rocm and AMD GPU driver installation
Next I installed, Rocm and the AMD GPU Driver by following the instructions here. I was quite surprised that the Rocm installation required 23GB of disk space.
$ sudo apt install rocm
...
Need to get 5345 MB of archives.
After this operation, 23.0 GB of additional disk space will be used.
Do you want to continue? [Y/n]
I verified the installation by using rocm-smi
$ rocm-smi
======================================== ROCm System Management Interface ========================================
================================================== Concise Info ==================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK  MCLK  Fan  Perf  PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)
==================================================================================================================
0       1     0x1586,   40251  27.0°C  5.083W    N/A, N/A, 0         N/A   N/A   0%   auto  N/A     0%     0%
==================================================================================================================
============================================== End of ROCm SMI Log ===============================================
Testing qwen3:32b showed improved performance. I assume this is the result of the updated drivers.
total duration:       1m49.043363047s
load duration:        51.078806ms
prompt eval count:    20 token(s)
prompt eval duration: 202.439512ms
prompt eval rate:     98.79 tokens/s
eval count:           1021 token(s)
eval duration:        1m48.316184545s
eval rate:            9.43 tokens/s
gpt-oss:120b also showed improved performance.
total duration:       9.300572016s
load duration:        100.106345ms
prompt eval count:    77 token(s)
prompt eval duration: 144.640986ms
prompt eval rate:     532.35 tokens/s
eval count:           295 token(s)
eval duration:        8.925786695s
eval rate:            33.05 tokens/s
Just under 50tps for gpt-oss:20b!
total duration:       7.016576027s
load duration:        96.902471ms
prompt eval count:    77 token(s)
prompt eval duration: 159.437642ms
prompt eval rate:     482.95 tokens/s
eval count:           305 token(s)
eval duration:        6.602954724s
eval rate:            46.19 tokens/s
Llama.cpp
sudo apt install build-essential cmake libcurl4-openssl-dev
When building llama.cpp for AMD GPUs the HIP build instructions require an AMDGPU_TARGET to be set. I found this using rocminfo.
$ rocminfo
ROCk module version 6.14.14 is loaded
...
Agent 2                  
*******                  
  Name:                    gfx1151                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics            
Then I ran a build using
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx1151 \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DGGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
&& cmake --build build --config Release -- -j 16
Then I tested it against ggml-org/gemma-3-1b-it-GGUF.
$  ./build/bin/llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
llama_perf_sampler_print:    sampling time =      26.93 ms /   366 runs   (    0.07 ms per token, 13590.29 tokens per second)
llama_perf_context_print:        load time =     491.51 ms
llama_perf_context_print: prompt eval time =      34.05 ms /    19 tokens (    1.79 ms per token,   557.99 tokens per second)
llama_perf_context_print:        eval time =    2141.20 ms /   347 runs   (    6.17 ms per token,   162.06 tokens per second)
llama_perf_context_print:       total time =   14215.85 ms /   366 tokens
llama_perf_context_print:    graphs reused =        345
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - ROCm0 (Graphics)   | 65536 = 63876 + (1314 =   762 +      38 +     514) +         345 |
llama_memory_breakdown_print: |   - Host               |                   318 =   306 +       0 +      12                |
I pulled the same model using ollama to compare. I believe llama.cpp uses the q4k_m model by default so it should be fair.
$ ollama run gemma3:1b --verbose
total duration:       1.979890959s
load duration:        124.143692ms
prompt eval count:    19 token(s)
prompt eval duration: 35.132589ms
prompt eval rate:     540.81 tokens/s
eval count:           271 token(s)
eval duration:        1.718437609s
eval rate:            157.70 tokens/s
gpt-oss:120b managed to hit 45tps
llama_perf_sampler_print:    sampling time =      27.84 ms /   281 runs   (    0.10 ms per token, 10092.67 tokens per second)
llama_perf_context_print:        load time =   11317.00 ms
llama_perf_context_print: prompt eval time =     138.78 ms /    16 tokens (    8.67 ms per token,   115.29 tokens per second)
llama_perf_context_print:        eval time =    5828.50 ms /   264 runs   (   22.08 ms per token,    45.29 tokens per second)
llama_perf_context_print:       total time =  593306.52 ms /   280 tokens
llama_perf_context_print:    graphs reused =        262
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - ROCm0 (Graphics)   | 65536 = 4734 + (60421 = 59851 +     171 +     398) +         380 |
llama_memory_breakdown_print: |   - Host               |                   601 =   586 +       0 +      15                |
gpt-oss:20b reports 65tps
llama_perf_sampler_print:    sampling time =      12.83 ms /   263 runs   (    0.05 ms per token, 20495.64 tokens per second)
llama_perf_context_print:        load time =    1017.53 ms
llama_perf_context_print: prompt eval time =      72.33 ms /    16 tokens (    4.52 ms per token,   221.20 tokens per second)
llama_perf_context_print:        eval time =    3754.42 ms /   246 runs   (   15.26 ms per token,    65.52 tokens per second)
llama_perf_context_print:       total time =  186207.11 ms /   262 tokens
llama_perf_context_print:    graphs reused =        244
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - ROCm0 (Graphics)   | 65536 = 53694 + (11461 = 10949 +     114 +     398) +         380 |
llama_memory_breakdown_print: |   - Host               |                    601 =   586 +       0 +      15                |
iGPU tweaks
To ensure that all of the VRAM (128GB) is addressable for bigger models, I made a few adjustments in the BIOS sourced from here.
- Set UMA frame buffer size to 1G (This was the minimum in my bios). Interestingly this is the value that gets reported by rocm-smi --showmeminfo vram[1]============================ ROCm System Management Interface ============================ ================================== Memory Usage (Bytes) ================================== GPU[0] : VRAM Total Memory (B): 1073741824 GPU[0] : VRAM Total Used Memory (B): 163188736 ========================================================================================== ================================== End of ROCm SMI Log ===================================
- Disable IOMMU
Next I added the following kernel boot options to GRUB to set the GTT and TTM sizes.
$ sudo nano /etc/default/grub
# Update GRUB_CMDLINE_LINUX_DEFAULT to 
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432"
I verified this using
$ sudo dmesg | grep -i gtt
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0-86-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 vt.handoff=7
[    0.068142] Kernel command line: BOOT_IMAGE=/vmlinuz-6.8.0-86-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 vt.handoff=7
[    3.527604] amdgpu 0000:c5:00.0: amdgpu: [drm] Configuring gttsize via module parameter is deprecated, please use ttm.pages_limit
[    3.527604] amdgpu 0000:c5:00.0: amdgpu: [drm] GTT size has been set as 137438953472 but TTM size has been set as 66813538304, this is unusual
[    3.527605] [drm] amdgpu: 131072M of GTT memory ready.
and
$ sudo dmesg | grep -i ttm
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0-86-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 vt.handoff=7
[    0.068142] Kernel command line: BOOT_IMAGE=/vmlinuz-6.8.0-86-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=off amdgpu.gttsize=131072ttm.pages_limit=33554432 vt.handoff=7
[    3.527604] amdgpu 0000:c5:00.0: amdgpu: [drm] Configuring gttsize via module parameter is deprecated, please use ttm.pages_limit
[    3.527604] amdgpu 0000:c5:00.0: amdgpu: [drm] GTT size has been set as 137438953472 but TTM size has been set as 66813538304, this is unusual
Llama bench
To compare performance with the DGX Spark, I ran llama-bench with params I found here.
./build/bin/llama-bench -m model.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 --mmap 0
The all have the preamble
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
and end with
build: 03792ad9 (6816)
gpt-oss:20b
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s | 
|---|---|---|---|---|---|---|---|---|
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | pp2048 | 1621.75 ± 122.61 | 
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | tg32 | 65.73 ± 0.07 | 
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | pp2048 @ d4096 | 1172.54 ± 1.82 | 
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | tg32 @ d4096 | 59.53 ± 0.06 | 
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | pp2048 @ d8192 | 950.99 ± 1.95 | 
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | tg32 @ d8192 | 57.25 ± 0.06 | 
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | pp2048 @ d16384 | 695.44 ± 0.78 | 
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | tg32 @ d16384 | 53.79 ± 0.05 | 
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | pp2048 @ d32768 | 451.42 ± 0.54 | 
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | tg32 @ d32768 | 47.71 ± 0.06 | 
gpt-oss:120b
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s | 
|---|---|---|---|---|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 | 818.11 ± 9.03 | 
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 | 46.05 ± 0.18 | 
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 650.83 ± 2.16 | 
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 42.45 ± 0.03 | 
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 542.66 ± 1.71 | 
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 40.88 ± 0.04 | 
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 411.87 ± 1.60 | 
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 38.39 ± 0.06 | 
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d32768 | 274.69 ± 0.65 | 
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d32768 | 34.15 ± 0.01 | 
Qwen3 Coder 30B A3B
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s | 
|---|---|---|---|---|---|---|---|---|---|
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 | 773.45 ± 44.44 | 
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 | 50.16 ± 0.22 | 
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 534.51 ± 1.19 | 
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 44.28 ± 0.03 | 
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 407.36 ± 0.54 | 
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 40.25 ± 0.03 | 
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 274.46 ± 0.34 | 
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 34.89 ± 0.03 | 
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d32768 | 166.77 ± 0.24 | 
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d32768 | 27.59 ± 0.01 | 
- From what I can tell - rocm-smidoesn't report VRAM usage correctly. A more accurate reflection of VRAM consumption seems to be given by- free -h. ↩︎