DevQuasar

community
Verified
Activity Feed

AI & ML interests

Open-Source LLMs, Local AI Projects: https://pypi.org/project/llm-predictive-router/

Recent Activity

csabakecskemetiย 
posted an update 16 days ago
view post
Post
3161
Just sharing a result of a homelab infrastructure experiment:

I've managed to setup a distributed inference infra at home using a DGX Spark (128GB unified gddr6) and a linux workstation with an RTX 6000 Pro (96GB gddr7) connected via 100Gbps RoCEv2. The model I've used (https://lnkd.in/gx6J7YuB) is about 140GB so could not fit either of the GPU. Full setup and tutorial soon on devquasar.com



Screen recording:
https://lnkd.in/gKM9H5GJ
ยท
csabakecskemetiย 
posted an update about 1 month ago
csabakecskemetiย 
posted an update about 1 month ago
view post
Post
2078
Looking for some help to test an INT8 Deepseek 3.2:
SGLang supports Channel wise INT8 quants on CPUs with AMX instructions (Xeon 5 and above AFAIK)
https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/

Currently uploading an INT8 version of Deepseek 3.2 Speciale:
DevQuasar/deepseek-ai.DeepSeek-V3.2-Speciale-Channel-INT8

I cannot test this I'm on AMD
"AssertionError: W8A8Int8LinearMethod on CPU requires that CPU has AMX support"
(I assumed it can fall back to some non optimized kernel but seems not)

If anyone with the required resources (Intel Xeon 5/6 + ~768-1TB ram) can help to test this that would be awesome.

If you have hints how to make this work on AMD Threadripper 7000 Pro series please guide me.

Thanks all!
ยท
csabakecskemetiย 
posted an update 2 months ago
view post
Post
305
Recently there are so much activity on token efficient formats, I've also build a package (inspired by toon).

Deep-TOON

My goal was to token efficiently handle json structures with complex embeddings.

So this is what I've built on the weekend. Feel free try:

https://pypi.org/project/deep-toon/0.1.0/

csabakecskemetiย 
posted an update 3 months ago
view post
Post
2628
Christmas came early this year
ยท
csabakecskemetiย 
posted an update 7 months ago
view post
Post
3074
Has anyone ever backed up a model to a sequential tape drive, or I'm the world first? :D
Just played around with my retro PC that has got a tape driveโ€”did it just because I can.
ยท
csabakecskemetiย 
posted an update 8 months ago
csabakecskemetiย 
posted an update 9 months ago
csabakecskemetiย 
posted an update 9 months ago
csabakecskemetiย 
posted an update 10 months ago
view post
Post
3425
I'm collecting llama-bench results for inference with a llama 3.1 8B q4 and q8 reference models on varoius GPUs. The results are average of 5 executions.
The system varies (different motherboard and CPU ... but that probably that has little effect on the inference performance).

https://devquasar.com/gpu-gguf-inference-comparison/
the exact models user are in the page

I'd welcome results from other GPUs is you have access do anything else you've need in the post. Hopefully this is useful information everyone.
csabakecskemetiย 
posted an update 10 months ago
view post
Post
2413
Managed to get my hands on a 5090FE, it's beefy

| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | pp512 | 12207.44 ยฑ 481.67 |
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg128 | 143.18 ยฑ 0.18 |

Comparison with others GPUs
http://devquasar.com/gpu-gguf-inference-comparison/
csabakecskemetiย 
posted an update 10 months ago