Show HN: Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant

发布时间：2026-04-19T00:23:09.336433 阅读：6224 分类：tech

https://w418ufqpha7gzj-80.proxy.runpod.net

Started for myself, but since Im not using it continuously, sharing it:

Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant (TheTom/llama-cpp-turboquant) on RTX 3090 (Runpod spot instance).

5 parallel requests supported.. full context available (please don't misuse..there are no safety guards in place)

Open till spot instance lasts or max 4 hours.

And yes, no request logging (I don't even know how to do it with llama-server)

Prompt processing and generation speeds (at 8K context): 900t/s and 60t/s. And at 100K context: 450t/s and 30t/s.

Command used:

    ./build/bin/llama-server \
      -m ../Qwen3.6-35B-A3B-UD-Q5_K_M.gguf \
      --alias 'Qwen3-6-35B-A3B-turbo' \
      --ctx-size 262144 \
      --no-mmproj \
      --host 0.0.0.0 \
      --port 80 \
      --jinja \
      --flash-attn on \
      --cache-type-k turbo3 \
      --cache-type-v turbo3 \
      --reasoning off \
      --temp 0.6 \
      --top-p 0.95 \
      --top-k 20 \
      --min-p 0.0 \
      --presence-penalty 0.0 \
      --repeat-penalty 1.0 \
      --parallel 5.0 \
      --cont-batching \
      --threads 16 \
      --threads-batch 16

延伸阅读