6 articles
https://prismml.com/news/bonsai-image-4b   submitted by   /u/Addyad [link]   [comments]
I have been running a local setup for document QA and the output quality varies a lot depending on what the pdf looks like when it hits the LLM. clean prose docs are fine but anything with tables or โฆ
I have been running a local setup for document QA and the output quality varies a lot depending on what the pdf looks like when it hits the LLM. clean prose docs are fine but anything with tables or multi column layouts comes out garbled and the model just works with whatever broken input it got. (No complaints, no demands sort of thing) I had tried pymupdf and pdfplumber and both were decent for simple stuff tho. now stuck trying to figure out whether to go with docling or llamaparse for the messier docs, both keep coming up but i cant tell which actually makes sense for my setup or if theres something else people are using locally that holds up better. Whats your take on these guys?? Which one would be more practical   submitted by   /u/TangeloOk9486 [link]   [comments]
I use local ai mainly for creative writing, and benchmarks are a bit iffy on that I feel like. Iโd like to compare Gemma mainly to Gemini as I like their writing the best, I do know that qwen 3.6 is โฆ
I use local ai mainly for creative writing, and benchmarks are a bit iffy on that I feel like. Iโd like to compare Gemma mainly to Gemini as I like their writing the best, I do know that qwen 3.6 is amazing but mostly for coding and agentic work. Iโd like to ask everyone how the new(er?) models feel to you personally rather than looking at benchmarks which they are likely optimised for. For me, I feel like Gemma 4 31B (even q4) still falls short of 2.5 pro, Iโm most familiar with 2.5 pro since I used so much of it for free on ai studio when it was a preview. The style and prose are there but long context it still misremembers minor details. I think itโs actually better than gpt 4.5, but tha could be personal preference since, again, I do mostly only creative writing   submitted by   /u/opoot_ [link]   [comments]
For two weeks I ran my multi-agent orchestrator entirely on Qwen3.6-27B via Ollama, on a single 3090. The goal: see if a local model could replace Claude as the reasoning layer for the lead/manager/โฆ
For two weeks I ran my multi-agent orchestrator entirely on Qwen3.6-27B via Ollama, on a single 3090. The goal: see if a local model could replace Claude as the reasoning layer for the lead/manager/sub-agent loop. Here's where it worked and where it broke. Setup: - RTX 3090, 24GB VRAM - Qwen3.6-27B at Q6_K (~22GB on-GPU), 32k effective context - Ollama as the inference engine - Multi-agent orchestrator with structured-JSON plans, plan-approval modal, auto-review pass after sub-agent completion - Tested across 47 multi-step coding workflows over two real repos What worked (the reasoning layer): - Plan generation. Qwen3.6 generated multi-step plans roughly as well as Claude on these tasks. Slightly more conservative (fewer unsolicited "let me also refactor X" steps), but coherent and schema-valid at ~95% after a few prompt tweaks. The remaining 5% were schema fixable with one re-prompt. - Memory extraction. Mem0-style fact extraction every 6 turns worked fine. Qwen pulled out the same kinds of facts Claude does ("user prefers no comments unless they explain a 'why'") and stored them cleanly in Qdrant. - Auto-review of sub-agent output. A second Qwen instance reviewing the first one's code caught roughly 60% of the bugs Claude's review caught on the same set. Less savage. Still useful and free. Where it broke: - Tool-call reliability. Qwen3.6's JSON tool-call output had a ~12% format error rate across the 47 tasks. Claude was ~0.5% on the same workload. The errors weren't malformed JSON they were wrong field names, wrong types, hallucinated tool signatures. Outlines / strict-output mode reduced it but didn't kill it. - Long-context drift. Past ~14k tokens of accumulated session context, Qwen started misremembering decisions it had made earlier ("you said use Postgres" no, I said the opposite). Hard practical limit ~12k tokens, then aggressive summarize-and-reset. - Cascade-failure handling. When a sub-agent failed, Claude's planner usuall
Now this is local AI innovation we can all get behind. https://x.com/stevencheng/status/2059836738449854898   submitted by   /u/No_Information9314 [link]   [comments]