41 articles from r/LocalLLaMA
I'm looking for models that can run on my GPU and actually do something useful. I think that any small difference could be a "big" improvement, because they are all so small. So I went to …
I'm looking for models that can run on my GPU and actually do something useful. I think that any small difference could be a "big" improvement, because they are all so small. So I went to the LM studio database and searched many variants from the same family, trying to select the newer models. Then asked claude to select known benchmarks and then run some qualitative tests. Now I'll try to test with real use cases and then select a "team". Most of the people runs local with more powerful machines. But the majority of the people barely has a 6gb gpu. So this review may help them. Below goes the report: The problem. I want local models doing repetitive overnight work (file organization, tagging, log triage) on a 6GB laptop GPU — zero cost, private, no rate limits. The real question isn't "which model is best" but "which of these specific quants actually fit in 6GB and behave correctly on my tasks." Leaderboard scores don't answer that: they're run on full-precision weights and generic benchmarks, not the Q4/Q6 GGUF you'll actually load. Why qualitative probing instead of full benchmark suites. Running BFCL-v3/v4 + IFEval + MMLU across 20 models on one 6GB GPU is on the order of days-to-a-week of compute, and most of that signal is already published per model family. What's not published is how a given quant behaves on the exact behaviors I need. So I built a fixed 6-probe set targeting those behaviors — (1) parseable tool-call, (2) multi-turn tool-call (does it chain with the real tool result or hallucinate a placeholder), (3) strict JSON, (4) instruction adherence (IFEval-style), (5) plan decomposition, (6) no path hallucination, plus a GSM8K-style arithmetic check — judged the outputs directly, and triangulated against published BFCL/IFEval to catch quant-level regressions. That turns a week into ~1 hour and tests the thing that actually matters. Then a separate performance pass measured prefill (prompt-processing) speed
I'm currently working on a chinese/CCP AI bias benchmark, and this has stood out as an outlier. All the other Minimax models are censored as is typical for chinese LLMs.   submitted by   /…
I'm currently working on a chinese/CCP AI bias benchmark, and this has stood out as an outlier. All the other Minimax models are censored as is typical for chinese LLMs.   submitted by   /u/DingyAtoll [link]   [comments]
so we have StepFun MTP, before Gemma MTP (https://github.com/ggml-org/llama.cpp/pull/23398) :)   submitted by   /u/jacek2023 [link]   [comments]
Most agent framework debates skip the first question: Do you need a framework at all? For one agent calling one or two tools, I would usually skip LangGraph, CrewAI, AutoGen, and most orchestration l…
Most agent framework debates skip the first question: Do you need a framework at all? For one agent calling one or two tools, I would usually skip LangGraph, CrewAI, AutoGen, and most orchestration layers. Raw model calls plus structured outputs are easier to inspect, cheaper to run, and less painful to debug. Frameworks start earning their complexity when you need branching control flow, persistent state, retries, human approval gates, memory, multi-agent coordination, or long-running execution. My rough 2026 map: Use case Pick Stateful production workflow LangGraph Fast multi-agent prototype CrewAI RAG-heavy agent LlamaIndex Deterministic retrieval pipeline Haystack Type-safe Python service Pydantic AI Persistent memory assistant Letta Code-executing lightweight agents Smolagents Browser automation Browser Use Open-source coding agent OpenHands / Goose TypeScript product Mastra Streaming AI UI Vercel AI SDK My personal rule: If the workflow is simple, avoid the framework. If the workflow needs state, approvals, retries, audit trails, or complex routing, use LangGraph. If the goal is to prototype a multi-agent role pipeline quickly, use CrewAI. If retrieval is the real problem, start with LlamaIndex or Haystack before adding an agent layer. If long-term memory is the product, look at Letta. If browser control is the job, Browser Use is the more relevant category. The biggest mistake I see is choosing an agent framework before defining the job. A good agent spec should say what the agent can do, which tools it can call, what state it needs, when a human must approve, and what failure looks like. Without that, the framework debate is mostly noise.   submitted by   /u/Straight_Stomach812 [link]   [comments]
  submitted by   /u/Helpful_Today7449 [link]   [comments]
https://prismml.com/news/bonsai-image-4b   submitted by   /u/Addyad [link]   [comments]
now you can enable/disable/limit thinking (check the video)   submitted by   /u/jacek2023 [link]   [comments]
I have been running a local setup for document QA and the output quality varies a lot depending on what the pdf looks like when it hits the LLM. clean prose docs are fine but anything with tables or …
I have been running a local setup for document QA and the output quality varies a lot depending on what the pdf looks like when it hits the LLM. clean prose docs are fine but anything with tables or multi column layouts comes out garbled and the model just works with whatever broken input it got. (No complaints, no demands sort of thing) I had tried pymupdf and pdfplumber and both were decent for simple stuff tho. now stuck trying to figure out whether to go with docling or llamaparse for the messier docs, both keep coming up but i cant tell which actually makes sense for my setup or if theres something else people are using locally that holds up better. Whats your take on these guys?? Which one would be more practical   submitted by   /u/TangeloOk9486 [link]   [comments]
Just released a deep benchmark of 8 tiny LLMs (135M → ~1B) on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA - across all 4 power modes: 7W, 15W, 25W, and MAXN Hardware: NVIDIA Ampere GPU …
Just released a deep benchmark of 8 tiny LLMs (135M → ~1B) on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA - across all 4 power modes: 7W, 15W, 25W, and MAXN Hardware: NVIDIA Ampere GPU - 1024 CUDA cores, 32 Tensor cores 6× Arm Cortex-A78AE CPU @ 1.728 GHz 8 GB LPDDR5 @ 204.8 GB/s (unified CPU + GPU - no VRAM split) Active fan cooling - peak junction temp stayed ≤ 73 °C across every run Stack: JetPack R36.4.7 (Ubuntu 22.04), CUDA 12.6 llama.cpp CUDA backend, all layers on GPU (-ngl 99) Load: NVIDIA aiperf — 20 requests per combo, 12 prompt × gen combos per model Power measured via tegrastats VDD_CPU_GPU_CV rail at 500ms intervals Brief methodology: Sweep: prompt ∈ {128, 512, 1024, 2048} tokens × gen ∈ {64, 128, 256} tokens × 4 power modes = 384 benchmark cells per model, 8 models. Key metric: output tok/J = tokens generated per joule of compute energy Findings: Key finding: 25W is the Pareto-optimal mode for every model we have tested. 36–47% more tok/s than 15W 3–26% better output tok/J than 15W 8–35% better output tok/J than MAXN More clocks ≠ more efficiency. MAXN costs ~17% more power for marginal throughput gains. Sub-1B standouts at 25W: SmolLM2-135M - 165 tok/s, 22.6 tok/J (best in suite), 101 MB, ~5.4W. LFM2.5-350M - 120 tok/s in 219 MB. Matches SmolLM2-360M (369 MB) at less than half the size. ~1B class at 25W (ctx=2048, gen=256): LFM2.5-1.2B: 54.1 tok/s, 5.26 tok/J, 698 MB - fastest + best output tok/J in ~1B class Gemma3-1B: edges ahead on total tok/J (118.5 vs 116.2) - lower power draw (6.87W vs 8.46W) compensates for slower decode Llama3.2-1B: 47.0 tok/s, 4.67 tok/J Full blog with all charts, heatmaps, latency tables, and raw HuggingFace datasets (384 cells × 4 modes) linked in the blog! Do check it out — and if you have a Jetson, what are you running on it? Would love to know! Blog   submitted by   /u/East-Muffin-6472 [link]   [comments]
Hey everyone, looking for a sanity check before I commit to an architecture. The goal: a free, fully offline study assistant that runs on a student’s laptop and acts as a tutor for one specific textb…
Hey everyone, looking for a sanity check before I commit to an architecture. The goal: a free, fully offline study assistant that runs on a student’s laptop and acts as a tutor for one specific textbook. Not an expert system — more a patient TA that “speaks the language of the book,” answers in its framing and notation, and points the student to where to look (chapter/section/page) and how to find related material. Part of the point is also introducing students to local LLMs as a real study tool. Constraints: offline, free (no API calls), packaged so a non-technical student can install and run it. Assuming a laptop with a dedicated GPU as the realistic minimum. My current thinking (poke holes please): for “grounded in the book + point to where info lives,” RAG looks like the workhorse — chunk the textbook, embed it, retrieve, and force answers from the passages with citations to section/page. I’m skeptical LoRA should carry content; I suspect its value is mostly stylistic/pedagogical (tone, Socratic vs. direct), and that pushing textbook facts into a LoRA is the wrong tool. Right that RAG is the core and LoRA optional? Questions: Best small model for laptop RAG? I’ve had decent luck with Qwen and Gemma — anything better for instruction-following + faithfulness at that size? Chunking a textbook is messy — figures, equations, tables, footnotes. Strategies that preserve structure and keep citations meaningful? Does a LoRA add anything over solid RAG, or is it just style? If it helps, fine-tune on Q&A pairs generated from the book? “Where/how to find it” — just surface retrieved chunk metadata, or something smarter? Packaging for non-technical users — Ollama + a simple local UI? Anything that bundles model + index into near one-click? Happy to report back once it works. Thanks!   submitted by   /u/HomoAgens1 [link]   [comments]
I use local ai mainly for creative writing, and benchmarks are a bit iffy on that I feel like. I’d like to compare Gemma mainly to Gemini as I like their writing the best, I do know that qwen 3.6 is …
I use local ai mainly for creative writing, and benchmarks are a bit iffy on that I feel like. I’d like to compare Gemma mainly to Gemini as I like their writing the best, I do know that qwen 3.6 is amazing but mostly for coding and agentic work. I’d like to ask everyone how the new(er?) models feel to you personally rather than looking at benchmarks which they are likely optimised for. For me, I feel like Gemma 4 31B (even q4) still falls short of 2.5 pro, I’m most familiar with 2.5 pro since I used so much of it for free on ai studio when it was a preview. The style and prose are there but long context it still misremembers minor details. I think it’s actually better than gpt 4.5, but tha could be personal preference since, again, I do mostly only creative writing   submitted by   /u/opoot_ [link]   [comments]
For two weeks I ran my multi-agent orchestrator entirely on Qwen3.6-27B via Ollama, on a single 3090. The goal: see if a local model could replace Claude as the reasoning layer for the lead/manager/…
For two weeks I ran my multi-agent orchestrator entirely on Qwen3.6-27B via Ollama, on a single 3090. The goal: see if a local model could replace Claude as the reasoning layer for the lead/manager/sub-agent loop. Here's where it worked and where it broke. Setup: - RTX 3090, 24GB VRAM - Qwen3.6-27B at Q6_K (~22GB on-GPU), 32k effective context - Ollama as the inference engine - Multi-agent orchestrator with structured-JSON plans, plan-approval modal, auto-review pass after sub-agent completion - Tested across 47 multi-step coding workflows over two real repos What worked (the reasoning layer): - Plan generation. Qwen3.6 generated multi-step plans roughly as well as Claude on these tasks. Slightly more conservative (fewer unsolicited "let me also refactor X" steps), but coherent and schema-valid at ~95% after a few prompt tweaks. The remaining 5% were schema fixable with one re-prompt. - Memory extraction. Mem0-style fact extraction every 6 turns worked fine. Qwen pulled out the same kinds of facts Claude does ("user prefers no comments unless they explain a 'why'") and stored them cleanly in Qdrant. - Auto-review of sub-agent output. A second Qwen instance reviewing the first one's code caught roughly 60% of the bugs Claude's review caught on the same set. Less savage. Still useful and free. Where it broke: - Tool-call reliability. Qwen3.6's JSON tool-call output had a ~12% format error rate across the 47 tasks. Claude was ~0.5% on the same workload. The errors weren't malformed JSON they were wrong field names, wrong types, hallucinated tool signatures. Outlines / strict-output mode reduced it but didn't kill it. - Long-context drift. Past ~14k tokens of accumulated session context, Qwen started misremembering decisions it had made earlier ("you said use Postgres" no, I said the opposite). Hard practical limit ~12k tokens, then aggressive summarize-and-reset. - Cascade-failure handling. When a sub-agent failed, Claude's planner usuall
Do you have config recommandations ? Apparently you need to set up the following to have proper thinking in models.json : "compat": { "supportsDeveloperRole": false, "suppo…
Do you have config recommandations ? Apparently you need to set up the following to have proper thinking in models.json : "compat": { "supportsDeveloperRole": false, "supportsReasoningEffort": true "thinkingFormat": "qwen-chat-template", "supportsStrictMode": false, "maxTokensField": "max_tokens" }, Is it still true ? https://github.com/earendil-works/pi/issues/2020 What do you use as extensions ? I designed a few ones for my current assistant (nanobot) - I checked pi-acp to use it with openacp and telegram, I'm going to look at websearch - memory ? - todo list ? - others ?   submitted by   /u/Nyghtbynger [link]   [comments]
Joining this community sparked a new hobby and interest in software engineering that I had lost. So I made this dual rtx 3090 build mostly for inference , I know I won’t be replacing chatgpt anytime …
Joining this community sparked a new hobby and interest in software engineering that I had lost. So I made this dual rtx 3090 build mostly for inference , I know I won’t be replacing chatgpt anytime soon but what tool stack would help it be usable in a work environment ? Must MCP servers or custom tools/scripts ? Currently using VScode preview with qwen3.6 27b and an nginx server, Im mostly interested in agentic work with usable context or at least a better knowledge of code base ( RAG pipeline?) Been already such a helpful community , hopefully local llms continue to grow because I fear cloud will become unaffordable at a consumer level   submitted by   /u/Sufficient_Phone_242 [link]   [comments]
I have not used MTP yet, are mmproj files different and could be speed up? Are they compatible between models MTP vs. non-MTP? E.g. https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/mmpro…
I have not used MTP yet, are mmproj files different and could be speed up? Are they compatible between models MTP vs. non-MTP? E.g. https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/mmproj-BF16.gguf vs. https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/blob/main/mmproj-BF16.gguf Differ by kv_count (and looks to me by nothing else in metadata, size is same), surprisingly older has 35 while "MTP" variant less : 33.   submitted by   /u/alex20_202020 [link]   [comments]
Hey! Does anyone have experience with the model below? Its supposed to be an object detection model, and I am working on a research project that would involve counting sets of plants in a warehouse. …
Hey! Does anyone have experience with the model below? Its supposed to be an object detection model, and I am working on a research project that would involve counting sets of plants in a warehouse. Based on my limited testing, this thing seems to be working quite well, but I am looking for anyone who might have used this more extensively 😄. https://huggingface.co/nvidia/LocateAnything-3B   submitted by   /u/Scared-Tip7914 [link]   [comments]
GGUFs: https://huggingface.co/models?library=gguf&other=base_model:quantized:stepfun-ai%2FStep-3.7-Flash&sort=trending Next question probably .... when are we getting MTP support? We have an …
GGUFs: https://huggingface.co/models?library=gguf&other=base_model:quantized:stepfun-ai%2FStep-3.7-Flash&sort=trending Next question probably .... when are we getting MTP support? We have an ongoing PR for Step-3.5-Flash https://github.com/ggml-org/llama.cpp/pull/23274   submitted by   /u/pmttyji [link]   [comments]
Llama benchmark results model size params backend ngl threads type_k type_v fa test t/s qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B SYCL 99 1 q8_0 q8_0 1 pp512 977.40 ± 2.02 qwen35moe 35…
Llama benchmark results model size params backend ngl threads type_k type_v fa test t/s qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B SYCL 99 1 q8_0 q8_0 1 pp512 977.40 ± 2.02 qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B SYCL 99 1 q8_0 q8_0 1 tg128 70.54 ± 0.12 I've chucked all my notes in an LLM and created an article if you want to recreate the same setup. I am currently using this with oh my pi and its very usable. I was able to create a well-designed poker game without it going in a loop or hanging/crashing. I've also tried intels vllm before but couldn't get it to this kind of performance for a single request, I see that there are some updates, so I will give that another shot when I have the time. Would love to hear if anyone's running a similar setup with any optimizations I'm missing, or anything in there that's actually doing nothing? Always looking to squeeze out more. Also massive thanks to the llama.cpp contributors and everyone working to make local inferencing viable. The fact that I can do this kind of inferencing locally is only possible because of the people building and maintaining this stuff.   submitted by   /u/Atomynos_Atom [link]   [comments]
  submitted by   /u/DeltaSqueezer [link]   [comments]
I got a reminder e-Mail from eBay about a MI50 I had put on my watch list after quite a while. Aside from needing to jerryrig a blower into the back and bootstrapping ROCm - how is it? In fact, what'…
I got a reminder e-Mail from eBay about a MI50 I had put on my watch list after quite a while. Aside from needing to jerryrig a blower into the back and bootstrapping ROCm - how is it? In fact, what's inference for LLMs like for non-CUDA? I know that image-gen is veeeeery hit or miss (although ComfyUI tries their very best) and TTS is, for all I know, CUDA bound right now. STT - like whisper.cpp - runs well enough on CPUs so that's a non-issue imo. Just curious; trying to spec a build out of curiosity for my homelab. All my previous ones would've blown way past 4k€ - so I keep looking and waiting, trying to hit 2-3k at most. I mostly just want 2-3 parallel inferences on a decent (~30B) model - doubtful I'll ever get good enough hardware for parallel 100B inference. xD So yeah, what's the current situation in non-CUDA-land? Thanks!   submitted by   /u/IngwiePhoenix [link]   [comments]
https://www.reddit.com/r/LocalLLM/comments/1tuf6l1/intel_arc_pro_b70_llamacpp_sycl_63_ts_on_qwen/   submitted by   /u/jacek2023 [link]   [comments]
https://preview.redd.it/xc0l68bj7t4h1.png?width=616&format=png&auto=webp&s=48a8b14bc4ae95700cd4efa76772f4e71fb2d41a https://huggingface.co/nvidia/LocateAnything-3B funny how they left thi…
https://preview.redd.it/xc0l68bj7t4h1.png?width=616&format=png&auto=webp&s=48a8b14bc4ae95700cd4efa76772f4e71fb2d41a https://huggingface.co/nvidia/LocateAnything-3B funny how they left this in the demo atleast it's honest   submitted by   /u/chocofoxy [link]   [comments]
https://huggingface.co/nvidia/Cosmos3-Super-Text2Image Nano: 16B Super: 64B Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and act…
https://huggingface.co/nvidia/Cosmos3-Super-Text2Image Nano: 16B Super: 64B Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory inputs. It serves as a foundational building block for a broad range of Physical AI applications and research spanning world understanding, world generation, simulation, and embodied policy learning. Haven't seen much here yet. Some twitter discussion: https://x.com/victormustar/status/2061354267546427595   submitted by   /u/RobotRobotWhatDoUSee [link]   [comments]
Moss tts 1.5 8b is better than fish audio s2 pro and qwen 3 tts voice clone tts. You can easily get more better quality if you set up the duration of the voice in output you want and some temperature…
Moss tts 1.5 8b is better than fish audio s2 pro and qwen 3 tts voice clone tts. You can easily get more better quality if you set up the duration of the voice in output you want and some temperature and other changes. This was just used on default setting. It can be improved more.   submitted by   /u/9r4n4y [link]   [comments]
initial version of official Gemma skills from Google   submitted by   /u/jacek2023 [link]   [comments]
Someone should create llama.ccp (not .cpp) that support LLMs on Chinese-native hardware (like Huawei’s Ascend 950PR), they are advancing fast in the recent months. Just thought the name would be funn…
Someone should create llama.ccp (not .cpp) that support LLMs on Chinese-native hardware (like Huawei’s Ascend 950PR), they are advancing fast in the recent months. Just thought the name would be funny.   submitted by   /u/Pancake502 [link]   [comments]
Now this is local AI innovation we can all get behind. https://x.com/stevencheng/status/2059836738449854898   submitted by   /u/No_Information9314 [link]   [comments]
3x 24GB vram. Qwen-coder-next is not bad. I'll continue to use it if you yell enough at me. I do a lot of front-end work, which develops rapidly, so the most recent the model the better. Larger tha…
3x 24GB vram. Qwen-coder-next is not bad. I'll continue to use it if you yell enough at me. I do a lot of front-end work, which develops rapidly, so the most recent the model the better. Larger than 80B and I'll have to sacrifice the decentish Q6 quant, or the minimum (for coding) 256k context. I do NOT believe that the latest 27-31B dense models can realistically beat an 80B model, even if I stomach the slowness, but change my mind. Slowness is an issue since I do NOT yolo. I micro-manage the heck out of the agent. It's actually more efficient than letting it rip, then having it rip again the next day because it had been climbing the wrong ladder.   submitted by   /u/ParaboloidalCrest [link]   [comments]
  submitted by   /u/Diablo-D3 [link]   [comments]
Currently using cloud models for my browser use and it’s great when it works but it’s one of the last things keeping me subscribed. What are you brilliant people doing to allow agentic browser use? F…
Currently using cloud models for my browser use and it’s great when it works but it’s one of the last things keeping me subscribed. What are you brilliant people doing to allow agentic browser use? For context M1 ultra Llamacpp w my own UI   submitted by   /u/AdInternational5848 [link]   [comments]
https://huggingface.co/openbmb/MiniCPM5-1B What even is this thing? MiniCPM 4.6 was a tuned Qwen 3.5 0.8B, but this looks like something else. It doesn't have vision, and it apparently has its own to…
https://huggingface.co/openbmb/MiniCPM5-1B What even is this thing? MiniCPM 4.6 was a tuned Qwen 3.5 0.8B, but this looks like something else. It doesn't have vision, and it apparently has its own tokenizer. The model itself is aware of existence of Qwen 2.5, but says it's not that. Is it a new model from scratch? I don't use agents, but I checked out mradermacher's Heretic Q6_K a bit and it seems to work quite fine. Pretty reasonable and brief thinking, unlike the "but wait" infinite loop of newer Qwens. And its speech pattern seems different from other small models I've tried. Hey, does nobody here get hyped about new tiny models anymore? Where's everybody?   submitted by   /u/WhoRoger [link]   [comments]
I wasn't sure whether to post this here or not but a friend of mine said that a lot of researchers lurk into this subreddit and it might help them, and I think it might also help anyone trying to tin…
I wasn't sure whether to post this here or not but a friend of mine said that a lot of researchers lurk into this subreddit and it might help them, and I think it might also help anyone trying to tinker with stuff at home, I don't know how much people do post-training here but I do see distills getting posted here and fine-tunings and datasets and benchmarks etc., so I think it might be interesting to you. For context, I work on post-training for agentic and tool-use capabilities, and I spent a few months a while ago almost literally living inside verl, ByteDance's RL post-training framework. I read most of the source and absorbed almost all of its knowledge and as I was working with it, I started wanting a "better" version, something with better dev experience for me, so I forked it (non-public, I abandoned it) to make it better (in my view) and while I shipped a lot fixes, and built tooling around it, at one point I had to stop, and it left a hole in my chest and I was finally wrote the whole thing up. As an au-revoir to it but also to get heir from it, all the knowledge and skill that I've learned from it. It's a close read of the parts that actually run an RLHF loop, plus some of the engineering a fork drags in, nothing major though, and one debugging story I'm still a little proud of. A quick tour of what the blog post is about: - The orchestration layer's internals: everything from the data structure (DataProto) every stage (rollout, reward, advantage, update) passes, and the API gotchas its names don't warn you about. There's also a half-finished migration to a plain TensorDict underneath it. - The single-controller pattern: one driver process holds the schedule and fans work out to GPU workers through a "magic attribute" dispatch system. That one is nasty, it took me so much time to wrap my head around it, but now that I do, it just feels so natural and helps me work on my own little package for orchestration layer in outmost confidence a
Can we please ban the daily "I have an RTX 3060, what should I run?" slop threads? It’s not complicated. As of right now, Hugging Face is empty and exactly two local models exist on this en…
Can we please ban the daily "I have an RTX 3060, what should I run?" slop threads? It’s not complicated. As of right now, Hugging Face is empty and exactly two local models exist on this entire planet: Qwen 3.6 35b a3b Qwen 3.6 27b That is the entire list. Your specs don’t matter. Your use case doesn’t matter. Stop coping with your pristine, full-precision Q8s of tiny 1B models just because they "fit perfectly in your VRAM." You look ridiculous. Grab a heavily brain-damaged, ultra-low quant of the 35B, force-feed it to your GPU, and let your system RAM bleed. A garbage quant of a massive model is a bagillion times better than your precious micro-models anyway. Just cram it in. And if you're going to whine that open source is dead because a local model won't instantly rewrite your entire enterprise codebase? Fine. Give up, pull out your credit card, and go spend your money on Claude Code like the rest of the contrarians. Can we pin this so everyone can finally shut up and stop posting? Thanks. Now, that has been solved lets go touch grass.   submitted by   /u/Wrong_Mushroom_7350 [link]   [comments]
best open Model quant that run in 3060 12gb and it's equivalent closed model in speed and time , for agentic coding.   submitted by   /u/Mother_Desk6385 [link]   [comments]
Check the slides from Computex. Every outlet that reported 600GB/s is completely wrong. That is the NvLink speed like everyone here said.   submitted by   /u/rpiguy9907 [link]   [comm…
Check the slides from Computex. Every outlet that reported 600GB/s is completely wrong. That is the NvLink speed like everyone here said.   submitted by   /u/rpiguy9907 [link]   [comments]
System Prompt: You are an expert software developer. Prompt: Task: make a Sonic The Hedgehog-like platform game Scaffold: none - just a single message in openwebui Model: Stepfun 3.7 Flash official Q…
System Prompt: You are an expert software developer. Prompt: Task: make a Sonic The Hedgehog-like platform game Scaffold: none - just a single message in openwebui Model: Stepfun 3.7 Flash official Q4_K_S This was the first try don't think I've tried this prompt on other models before. Pretty impressed with the control feel and sense of speed.   submitted by   /u/-dysangel- [link]   [comments]
This is more of a quick appreciation post for Qwen 3.6 27B running locally (8-bit unsloth quant). I've been using it mainly alongside my 35B model in OpenCode for planning and coding. I also had it s…
This is more of a quick appreciation post for Qwen 3.6 27B running locally (8-bit unsloth quant). I've been using it mainly alongside my 35B model in OpenCode for planning and coding. I also had it set up in Open WebUI, but until MTP support came about two weeks ago in llama.cpp, the TPS was so painfully slow on OWUI that it was basically unusable for chat. Since then, I paired them together and have been using Qwen 27B as a daily chat assistant alongside Gemini Pro. I've been keeping a running mental comparison between the two. For straightforward questions, Gemini handles things fine. But over the weekend I dove into some career advice and company portfolio deep dives, plus some immigration research. Gemini completely fell apart on this. It started hallucinating and fixating on stuff based on earlier messages in the conversation and my previous chats. I think this degradation have started to happen over last couple of weeks or so, wanted to know others experience with gemini lately. I ended up doing a lot of manual research myself. Then I decided to try same research with Qwen 3.6 27B. I was genuinely surprised by how much better it performed on both the career/company stuff and the immigration research. The immigration results really stood out because it had to actually go through official documentation and make sense of it rather than just regurgitating something. Side note: I've also tried Gemma 4 31B, which I heard is great for research and planning, but it's just too slow on my M5 Max with 128GB with 8 bit quant. Curious to know folks opinion here on that and maybe once MTP is enabled for that I will try it.   submitted by   /u/Character_Split4906 [link]   [comments]
Them boys can cook, one big fix after another! If you're running --sm tensor on multi-gpu this is the KV cache quantization fix https://github.com/ggml-org/llama.cpp/releases/tag/b9455 JohannesGaess…
Them boys can cook, one big fix after another! If you're running --sm tensor on multi-gpu this is the KV cache quantization fix https://github.com/ggml-org/llama.cpp/releases/tag/b9455 JohannesGaesslercommented5 days ago This PR implements support for the combination of -sm tensor and quantized KV cache. The reason why this doesn't work on master is that the flattening of tensors for the KV cache rotation leads to the loss of shape information which the meta backend cannot handle. There were previous PRs which resolved the issue by changing the shapes of the KV cache rotation but that is an undesirable solution because batched matrix multiplications may not be as well-supported in ggml backends as a single large matrix multiplication. Also it is generally better to extend the meta backend with capabilities to handle a compute graph than to require compute graphs to conform to the meta backend's requirments. The approach in this PR is to extend the specification ggml_backend_meta_split_state with a value that specifies how often a given segment repeats. When a tensor is flattened the meta backend uses segments to specify the data layout within the flattened dimension so that upon a further reshape the correct data layout can be restored. No changes to the llama.cpp compute graphs are required.   submitted by   /u/Bulky-Priority6824 [link]   [comments]
https://www.scan.co.uk/shop/ai-and-robotics/workstations-ai/nvidia-dgx-station   submitted by   /u/X-N2O [link]   [comments]
https://www.neowin.net/news/computex-2026-intel-launches-crescent-island-gpu-with-up-to-480gb-vram/ Crescent Island is based on the company"s Arc Xe 3P architecture which lies inside current Pa…
https://www.neowin.net/news/computex-2026-intel-launches-crescent-island-gpu-with-up-to-480gb-vram/ Crescent Island is based on the company"s Arc Xe 3P architecture which lies inside current Panther Lake iGPs as well. This is Intel"s latest, most powerful card and it packs up to 480 GB of VRAM capacity. Unlike typical high-end professional GPUs which rely on HBM for improving power efficiency, the Intel GPU here has LPDDR5X. Cooling on the unit is handled by air cooler that can handle a TDP of 350 watts. Intel says that these cards can deal with next generation AI workloads and come with support for a wide range of datatypes and microscaling formats, from native FP4/MXFP4 to FP64, and more.   submitted by   /u/ANR2ME [link]   [comments]
We just assumed that since it's a GB10 variant that it would have the same memory bandwidth as DGX Spark, 273GB/s. But it's reported that it will have double that, 600GB/s. "The unified memory a…
We just assumed that since it's a GB10 variant that it would have the same memory bandwidth as DGX Spark, 273GB/s. But it's reported that it will have double that, 600GB/s. "The unified memory architecture brings up to 128GB of LPDDR5X RAM with a bandwidth of 600GB/s" https://wccftech.com/nvidia-enters-pc-space-with-rtx-spark/ "its memory bandwidth peaks at 600 GB/s" https://www.notebookcheck.net/Nvidia-N1X-officially-confirmed-to-arrive-as-the-RTX-Spark.1312010.0.html   submitted by   /u/fallingdowndizzyvr [link]   [comments]