76 articles
I'm looking for models that can run on my GPU and actually do something useful. I think that any small difference could be a "big" improvement, because they are all so small. So I went to …
I'm looking for models that can run on my GPU and actually do something useful. I think that any small difference could be a "big" improvement, because they are all so small. So I went to the LM studio database and searched many variants from the same family, trying to select the newer models. Then asked claude to select known benchmarks and then run some qualitative tests. Now I'll try to test with real use cases and then select a "team". Most of the people runs local with more powerful machines. But the majority of the people barely has a 6gb gpu. So this review may help them. Below goes the report: The problem. I want local models doing repetitive overnight work (file organization, tagging, log triage) on a 6GB laptop GPU — zero cost, private, no rate limits. The real question isn't "which model is best" but "which of these specific quants actually fit in 6GB and behave correctly on my tasks." Leaderboard scores don't answer that: they're run on full-precision weights and generic benchmarks, not the Q4/Q6 GGUF you'll actually load. Why qualitative probing instead of full benchmark suites. Running BFCL-v3/v4 + IFEval + MMLU across 20 models on one 6GB GPU is on the order of days-to-a-week of compute, and most of that signal is already published per model family. What's not published is how a given quant behaves on the exact behaviors I need. So I built a fixed 6-probe set targeting those behaviors — (1) parseable tool-call, (2) multi-turn tool-call (does it chain with the real tool result or hallucinate a placeholder), (3) strict JSON, (4) instruction adherence (IFEval-style), (5) plan decomposition, (6) no path hallucination, plus a GSM8K-style arithmetic check — judged the outputs directly, and triangulated against published BFCL/IFEval to catch quant-level regressions. That turns a week into ~1 hour and tests the thing that actually matters. Then a separate performance pass measured prefill (prompt-processing) speed
I'm currently working on a chinese/CCP AI bias benchmark, and this has stood out as an outlier. All the other Minimax models are censored as is typical for chinese LLMs.   submitted by   /…
I'm currently working on a chinese/CCP AI bias benchmark, and this has stood out as an outlier. All the other Minimax models are censored as is typical for chinese LLMs.   submitted by   /u/DingyAtoll [link]   [comments]
so we have StepFun MTP, before Gemma MTP (https://github.com/ggml-org/llama.cpp/pull/23398) :)   submitted by   /u/jacek2023 [link]   [comments]
Most agent framework debates skip the first question: Do you need a framework at all? For one agent calling one or two tools, I would usually skip LangGraph, CrewAI, AutoGen, and most orchestration l…
Most agent framework debates skip the first question: Do you need a framework at all? For one agent calling one or two tools, I would usually skip LangGraph, CrewAI, AutoGen, and most orchestration layers. Raw model calls plus structured outputs are easier to inspect, cheaper to run, and less painful to debug. Frameworks start earning their complexity when you need branching control flow, persistent state, retries, human approval gates, memory, multi-agent coordination, or long-running execution. My rough 2026 map: Use case Pick Stateful production workflow LangGraph Fast multi-agent prototype CrewAI RAG-heavy agent LlamaIndex Deterministic retrieval pipeline Haystack Type-safe Python service Pydantic AI Persistent memory assistant Letta Code-executing lightweight agents Smolagents Browser automation Browser Use Open-source coding agent OpenHands / Goose TypeScript product Mastra Streaming AI UI Vercel AI SDK My personal rule: If the workflow is simple, avoid the framework. If the workflow needs state, approvals, retries, audit trails, or complex routing, use LangGraph. If the goal is to prototype a multi-agent role pipeline quickly, use CrewAI. If retrieval is the real problem, start with LlamaIndex or Haystack before adding an agent layer. If long-term memory is the product, look at Letta. If browser control is the job, Browser Use is the more relevant category. The biggest mistake I see is choosing an agent framework before defining the job. A good agent spec should say what the agent can do, which tools it can call, what state it needs, when a human must approve, and what failure looks like. Without that, the framework debate is mostly noise.   submitted by   /u/Straight_Stomach812 [link]   [comments]
  submitted by   /u/Helpful_Today7449 [link]   [comments]
https://prismml.com/news/bonsai-image-4b   submitted by   /u/Addyad [link]   [comments]
My paper is around 33 pages including but tpami guideline said it should be 20 pages Does anyone know which is correct? Its mistake it’s TPAMI   submitted by   /u/Alternative_Art2984 [lin…
My paper is around 33 pages including but tpami guideline said it should be 20 pages Does anyone know which is correct? Its mistake it’s TPAMI   submitted by   /u/Alternative_Art2984 [link]   [comments]
now you can enable/disable/limit thinking (check the video)   submitted by   /u/jacek2023 [link]   [comments]
Earlier I posted in a “Who wants to be hired?” thread, looking for a place where I could apply my experience in hospitality, food tech and automation.A couple hours later I received an email:“Hi Ilia…
Earlier I posted in a “Who wants to be hired?” thread, looking for a place where I could apply my experience in hospitality, food tech and automation.A couple hours later I received an email:“Hi Ilia,I saw your comment on the June Who’s Hiring thread. I build production-ready TypeScript and Python systems that integrate LLMs into real workflows, with particular focus on RAG, agent orchestration, and clear blah-blah-blah”Come on. I am a forced immigrant with a wife, a cat, rent and crushing debt, who’s been unemployed for 6 months. I am naturally an extremely optimistic person, but boy is energy on the low by now. And every e-mail in my inbox, especially one starting with something related to my job search, is a glimmer of hope. Just to be crushed by what comes next. Yes, it’s a minor cut, but those compound.Please just don’t do this.Maybe add a skill to your Claude Code called “empathy”? You can have your Claw access a “be considerate of other people’s experiences” MCP server! Or just ask your “Daily Grind Reminder” Telegram bot to recommend a good book of fiction from time to time. Just to develop some humanity.Sorry for venting. Comments URL: https://news.ycombinator.com/item?id=48370330 Points: 377 # Comments: 93
I was wondering if someone knew state of the art research about the hallucination problem for document search with LLMs. I know for example in math you can use some verifier to check a proof. What ab…
I was wondering if someone knew state of the art research about the hallucination problem for document search with LLMs. I know for example in math you can use some verifier to check a proof. What about document search with LLMs, when I feed them documents?   submitted by   /u/Saladino93 [link]   [comments]
Article URL: https://coveillance.org/a-walking-tour-of-surveillance-infrastructure-in-seattle/ Comments URL: https://news.ycombinator.com/item?id=48369980 Points: 142 # Comments: 48
I have been running a local setup for document QA and the output quality varies a lot depending on what the pdf looks like when it hits the LLM. clean prose docs are fine but anything with tables or …
I have been running a local setup for document QA and the output quality varies a lot depending on what the pdf looks like when it hits the LLM. clean prose docs are fine but anything with tables or multi column layouts comes out garbled and the model just works with whatever broken input it got. (No complaints, no demands sort of thing) I had tried pymupdf and pdfplumber and both were decent for simple stuff tho. now stuck trying to figure out whether to go with docling or llamaparse for the messier docs, both keep coming up but i cant tell which actually makes sense for my setup or if theres something else people are using locally that holds up better. Whats your take on these guys?? Which one would be more practical   submitted by   /u/TangeloOk9486 [link]   [comments]
Just released a deep benchmark of 8 tiny LLMs (135M → ~1B) on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA - across all 4 power modes: 7W, 15W, 25W, and MAXN Hardware: NVIDIA Ampere GPU …
Just released a deep benchmark of 8 tiny LLMs (135M → ~1B) on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA - across all 4 power modes: 7W, 15W, 25W, and MAXN Hardware: NVIDIA Ampere GPU - 1024 CUDA cores, 32 Tensor cores 6× Arm Cortex-A78AE CPU @ 1.728 GHz 8 GB LPDDR5 @ 204.8 GB/s (unified CPU + GPU - no VRAM split) Active fan cooling - peak junction temp stayed ≤ 73 °C across every run Stack: JetPack R36.4.7 (Ubuntu 22.04), CUDA 12.6 llama.cpp CUDA backend, all layers on GPU (-ngl 99) Load: NVIDIA aiperf — 20 requests per combo, 12 prompt × gen combos per model Power measured via tegrastats VDD_CPU_GPU_CV rail at 500ms intervals Brief methodology: Sweep: prompt ∈ {128, 512, 1024, 2048} tokens × gen ∈ {64, 128, 256} tokens × 4 power modes = 384 benchmark cells per model, 8 models. Key metric: output tok/J = tokens generated per joule of compute energy Findings: Key finding: 25W is the Pareto-optimal mode for every model we have tested. 36–47% more tok/s than 15W 3–26% better output tok/J than 15W 8–35% better output tok/J than MAXN More clocks ≠ more efficiency. MAXN costs ~17% more power for marginal throughput gains. Sub-1B standouts at 25W: SmolLM2-135M - 165 tok/s, 22.6 tok/J (best in suite), 101 MB, ~5.4W. LFM2.5-350M - 120 tok/s in 219 MB. Matches SmolLM2-360M (369 MB) at less than half the size. ~1B class at 25W (ctx=2048, gen=256): LFM2.5-1.2B: 54.1 tok/s, 5.26 tok/J, 698 MB - fastest + best output tok/J in ~1B class Gemma3-1B: edges ahead on total tok/J (118.5 vs 116.2) - lower power draw (6.87W vs 8.46W) compensates for slower decode Llama3.2-1B: 47.0 tok/s, 4.67 tok/J Full blog with all charts, heatmaps, latency tables, and raw HuggingFace datasets (384 cells × 4 modes) linked in the blog! Do check it out — and if you have a Jetson, what are you running on it? Would love to know! Blog   submitted by   /u/East-Muffin-6472 [link]   [comments]
Third in a series of papers tracking learning rules vs. human fMRI (THINGS dataset, V1–IT, N=3 subjects). Previous finding: untrained CNNs match backprop at V1. This paper asks: when does training br…
Third in a series of papers tracking learning rules vs. human fMRI (THINGS dataset, V1–IT, N=3 subjects). Previous finding: untrained CNNs match backprop at V1. This paper asks: when does training break that, and does the learning rule matter? Setup: RSA alignment measured at 8 checkpoints (epochs 0, 1, 2, 5, 10, 20, 30, 40), 5 seeds per rule, same architecture throughout. Main findings: BP drops 90% of V1 alignment after one epoch (r: 0.102 → 0.011, p = 0.031, consistent across all 5 seeds). FA drops 49%. PC and STDP drop only 25–31% and stabilise. By epoch 40: PC (r = 0.064) > STDP (0.059) >> BP (0.022) ≈ FA (0.019). Cohen's d > 5 for PC/STDP vs BP: extremely consistent across seeds. Opposing trend at LOC: BP shows a small increase in object-selective cortex alignment (+0.011) while local rules show nothing. Suggests a fundamental trade-off: global error signals build higher representations but destroy early ones. Degradation rate tracks error signal globality: exact gradients (BP) > random feedback (FA) > local prediction errors (PC, STDP). Limitations worth noting: 5 seeds caps permutation test resolution at p ≈ 0.031 Training on 32×32 CIFAR-10, evaluated on 224×224 THINGS, resolution/domain shift is a confound LOC increase not tested for significance, treated as suggestive Paper: arxiv.org/abs/2605.30556 Companion: arxiv.org/abs/2604.16875 Code: github.com/nilsleut Curious whether anyone has seen similar dynamics in larger architectures, the prediction would be that deeper models show the same pattern but more slowly.   submitted by   /u/ConfusionSpiritual19 [link]   [comments]
Hey everyone, looking for a sanity check before I commit to an architecture. The goal: a free, fully offline study assistant that runs on a student’s laptop and acts as a tutor for one specific textb…
Hey everyone, looking for a sanity check before I commit to an architecture. The goal: a free, fully offline study assistant that runs on a student’s laptop and acts as a tutor for one specific textbook. Not an expert system — more a patient TA that “speaks the language of the book,” answers in its framing and notation, and points the student to where to look (chapter/section/page) and how to find related material. Part of the point is also introducing students to local LLMs as a real study tool. Constraints: offline, free (no API calls), packaged so a non-technical student can install and run it. Assuming a laptop with a dedicated GPU as the realistic minimum. My current thinking (poke holes please): for “grounded in the book + point to where info lives,” RAG looks like the workhorse — chunk the textbook, embed it, retrieve, and force answers from the passages with citations to section/page. I’m skeptical LoRA should carry content; I suspect its value is mostly stylistic/pedagogical (tone, Socratic vs. direct), and that pushing textbook facts into a LoRA is the wrong tool. Right that RAG is the core and LoRA optional? Questions: Best small model for laptop RAG? I’ve had decent luck with Qwen and Gemma — anything better for instruction-following + faithfulness at that size? Chunking a textbook is messy — figures, equations, tables, footnotes. Strategies that preserve structure and keep citations meaningful? Does a LoRA add anything over solid RAG, or is it just style? If it helps, fine-tune on Q&A pairs generated from the book? “Where/how to find it” — just surface retrieved chunk metadata, or something smarter? Packaging for non-technical users — Ollama + a simple local UI? Anything that bundles model + index into near one-click? Happy to report back once it works. Thanks!   submitted by   /u/HomoAgens1 [link]   [comments]
Article URL: https://www.mitmllc.com/blog/apple-rejected-my-dictation-app/ Comments URL: https://news.ycombinator.com/item?id=48369088 Points: 160 # Comments: 100
I use local ai mainly for creative writing, and benchmarks are a bit iffy on that I feel like. I’d like to compare Gemma mainly to Gemini as I like their writing the best, I do know that qwen 3.6 is …
I use local ai mainly for creative writing, and benchmarks are a bit iffy on that I feel like. I’d like to compare Gemma mainly to Gemini as I like their writing the best, I do know that qwen 3.6 is amazing but mostly for coding and agentic work. I’d like to ask everyone how the new(er?) models feel to you personally rather than looking at benchmarks which they are likely optimised for. For me, I feel like Gemma 4 31B (even q4) still falls short of 2.5 pro, I’m most familiar with 2.5 pro since I used so much of it for free on ai studio when it was a preview. The style and prose are there but long context it still misremembers minor details. I think it’s actually better than gpt 4.5, but tha could be personal preference since, again, I do mostly only creative writing   submitted by   /u/opoot_ [link]   [comments]
For two weeks I ran my multi-agent orchestrator entirely on Qwen3.6-27B via Ollama, on a single 3090. The goal: see if a local model could replace Claude as the reasoning layer for the lead/manager/…
For two weeks I ran my multi-agent orchestrator entirely on Qwen3.6-27B via Ollama, on a single 3090. The goal: see if a local model could replace Claude as the reasoning layer for the lead/manager/sub-agent loop. Here's where it worked and where it broke. Setup: - RTX 3090, 24GB VRAM - Qwen3.6-27B at Q6_K (~22GB on-GPU), 32k effective context - Ollama as the inference engine - Multi-agent orchestrator with structured-JSON plans, plan-approval modal, auto-review pass after sub-agent completion - Tested across 47 multi-step coding workflows over two real repos What worked (the reasoning layer): - Plan generation. Qwen3.6 generated multi-step plans roughly as well as Claude on these tasks. Slightly more conservative (fewer unsolicited "let me also refactor X" steps), but coherent and schema-valid at ~95% after a few prompt tweaks. The remaining 5% were schema fixable with one re-prompt. - Memory extraction. Mem0-style fact extraction every 6 turns worked fine. Qwen pulled out the same kinds of facts Claude does ("user prefers no comments unless they explain a 'why'") and stored them cleanly in Qdrant. - Auto-review of sub-agent output. A second Qwen instance reviewing the first one's code caught roughly 60% of the bugs Claude's review caught on the same set. Less savage. Still useful and free. Where it broke: - Tool-call reliability. Qwen3.6's JSON tool-call output had a ~12% format error rate across the 47 tasks. Claude was ~0.5% on the same workload. The errors weren't malformed JSON they were wrong field names, wrong types, hallucinated tool signatures. Outlines / strict-output mode reduced it but didn't kill it. - Long-context drift. Past ~14k tokens of accumulated session context, Qwen started misremembering decisions it had made earlier ("you said use Postgres" no, I said the opposite). Hard practical limit ~12k tokens, then aggressive summarize-and-reset. - Cascade-failure handling. When a sub-agent failed, Claude's planner usuall
Do you have config recommandations ? Apparently you need to set up the following to have proper thinking in models.json : "compat": { "supportsDeveloperRole": false, "suppo…
Do you have config recommandations ? Apparently you need to set up the following to have proper thinking in models.json : "compat": { "supportsDeveloperRole": false, "supportsReasoningEffort": true "thinkingFormat": "qwen-chat-template", "supportsStrictMode": false, "maxTokensField": "max_tokens" }, Is it still true ? https://github.com/earendil-works/pi/issues/2020 What do you use as extensions ? I designed a few ones for my current assistant (nanobot) - I checked pi-acp to use it with openacp and telegram, I'm going to look at websearch - memory ? - todo list ? - others ?   submitted by   /u/Nyghtbynger [link]   [comments]
https://arxiv.org/abs/2605.29713 The Little Book of Generative AI Foundations: An Intuitive Mathematical Primer Tianhua Chen Abstract: "This book provides a compact, derivation-oriented introdu…
https://arxiv.org/abs/2605.29713 The Little Book of Generative AI Foundations: An Intuitive Mathematical Primer Tianhua Chen Abstract: "This book provides a compact, derivation-oriented introduction to the mathematical foundations of modern generative artificial intelligence. Rather than surveying every recent architecture or implementation detail, it develops a coherent route through the ideas connecting major families of generative models, from PCA, probabilistic PCA, variational autoencoders, and diffusion models to normalising flows, autoregressive factorisations, GANs, Wasserstein GANs, and energy-based models. The aim is to make the structure of generative modelling more accessible without removing the mathematical substance needed to understand how these models are derived and related. The book is intended as a foundation-building primer for mathematically curious researchers, practitioners, and students."   submitted by   /u/Nunki08 [link]   [comments]
Article URL: https://www.businessinsider.com/big-short-michael-burry-spacex-anthropic-ipo-ai-bubble-claude-2026-6 Comments URL: https://news.ycombinator.com/item?id=48368187 Points: 110 # Comments: 1…
Article URL: https://www.businessinsider.com/big-short-michael-burry-spacex-anthropic-ipo-ai-bubble-claude-2026-6 Comments URL: https://news.ycombinator.com/item?id=48368187 Points: 110 # Comments: 140
Article URL: https://blog.adafruit.com/ Comments URL: https://news.ycombinator.com/item?id=48368121 Points: 168 # Comments: 52
Joining this community sparked a new hobby and interest in software engineering that I had lost. So I made this dual rtx 3090 build mostly for inference , I know I won’t be replacing chatgpt anytime …
Joining this community sparked a new hobby and interest in software engineering that I had lost. So I made this dual rtx 3090 build mostly for inference , I know I won’t be replacing chatgpt anytime soon but what tool stack would help it be usable in a work environment ? Must MCP servers or custom tools/scripts ? Currently using VScode preview with qwen3.6 27b and an nginx server, Im mostly interested in agentic work with usable context or at least a better knowledge of code base ( RAG pipeline?) Been already such a helpful community , hopefully local llms continue to grow because I fear cloud will become unaffordable at a consumer level   submitted by   /u/Sufficient_Phone_242 [link]   [comments]
Article URL: https://seths.blog/2026/06/stop-ruining-it/ Comments URL: https://news.ycombinator.com/item?id=48368059 Points: 131 # Comments: 48
I have not used MTP yet, are mmproj files different and could be speed up? Are they compatible between models MTP vs. non-MTP? E.g. https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/mmpro…
I have not used MTP yet, are mmproj files different and could be speed up? Are they compatible between models MTP vs. non-MTP? E.g. https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/mmproj-BF16.gguf vs. https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/blob/main/mmproj-BF16.gguf Differ by kv_count (and looks to me by nothing else in metadata, size is same), surprisingly older has 35 while "MTP" variant less : 33.   submitted by   /u/alex20_202020 [link]   [comments]
presenting a poster there, and have registration covered. but they are placing me on waitlist for travel funds. As my travel depends on whether I get the travel grant, I need to get this off of my mi…
presenting a poster there, and have registration covered. but they are placing me on waitlist for travel funds. As my travel depends on whether I get the travel grant, I need to get this off of my mind, either invite me or just say no. I'm waiting forever for this, more wait again? should i ask for a decision, or what to do.   submitted by   /u/Active-Tip3130 [link]   [comments]
Hey! Does anyone have experience with the model below? Its supposed to be an object detection model, and I am working on a research project that would involve counting sets of plants in a warehouse. …
Hey! Does anyone have experience with the model below? Its supposed to be an object detection model, and I am working on a research project that would involve counting sets of plants in a warehouse. Based on my limited testing, this thing seems to be working quite well, but I am looking for anyone who might have used this more extensively 😄. https://huggingface.co/nvidia/LocateAnything-3B   submitted by   /u/Scared-Tip7914 [link]   [comments]
Article URL: https://ianthehenry.com/posts/why-janet/ Comments URL: https://news.ycombinator.com/item?id=48367907 Points: 103 # Comments: 35
Article URL: https://blog.tjll.net/you-dont-love-systemd-timers-enough/ Comments URL: https://news.ycombinator.com/item?id=48367904 Points: 127 # Comments: 71
GGUFs: https://huggingface.co/models?library=gguf&other=base_model:quantized:stepfun-ai%2FStep-3.7-Flash&sort=trending Next question probably .... when are we getting MTP support? We have an …
GGUFs: https://huggingface.co/models?library=gguf&other=base_model:quantized:stepfun-ai%2FStep-3.7-Flash&sort=trending Next question probably .... when are we getting MTP support? We have an ongoing PR for Step-3.5-Flash https://github.com/ggml-org/llama.cpp/pull/23274   submitted by   /u/pmttyji [link]   [comments]
Article URL: https://eyeball.rory.codes/ Comments URL: https://news.ycombinator.com/item?id=48367723 Points: 134 # Comments: 50
I built CVE-Bench: 20 real-world CVEs across 18 Python projects (Pillow, GitPython, yt-dlp, urllib3, others), 5 frontier models, 3 prompt conditions, 300 runs total. Each agent runs in a sandboxed co…
I built CVE-Bench: 20 real-world CVEs across 18 Python projects (Pillow, GitPython, yt-dlp, urllib3, others), 5 frontier models, 3 prompt conditions, 300 runs total. Each agent runs in a sandboxed container and is scored against a hidden test_security.py derived from the maintainer's own fix. Binary pass/fail (a 90%-patched vulnerability is still a vulnerability). To better understand failure modes, I've tested three prompt conditions: advisory (full GHSA report), diagnose (exploit description only, no file or function), and locate (exact file and function, no description of the flaw). The three conditions test meaningfully different things. A model that does well on advisory but drops on diagnose can’t translate a behavioral description into a location in the codebase. A model that holds up on locate is recognizing dangerous code on its own. The leaderboard isn't the finding. Best solve rate is 50% overall, 60% under advisory. Cross-family separation (OpenAI vs Laguna) is confirmed under McNemar's test with continuity correction (all four pairs cross α = 0.05). Within-family gaps are noise: a power analysis puts the task count needed to detect a meaningful within-family edge at ~700. That cuts both ways: if the expensive models had a large true advantage, 20 tasks would have been enough to surface it. gpt-5.5 at 12× the cost of gpt-5.4-mini is not the rational choice. All four cross-family pairwise comparisons reach statistical significance at α = 0.05 (McNemar test with continuity correction, n = 60 tasks per model pair): gpt-5.5 vs laguna-m.1 (p = 0.015), gpt-5.4-nano vs laguna-m.1 (p = 0.017), gpt-5.5 vs laguna-xs.2 (p = 0.028), gpt-5.4-nano vs laguna-xs.2 (p = 0.040). Within-family comparisons remain far from significance; those rankings should be read as approximate. The failure taxonomy is the most interesting finding. Wrong-search drift — model finds the right file early, makes one incorrect inference, spends the remaining turns chasing it. Budget expi
Llama benchmark results model size params backend ngl threads type_k type_v fa test t/s qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B SYCL 99 1 q8_0 q8_0 1 pp512 977.40 ± 2.02 qwen35moe 35…
Llama benchmark results model size params backend ngl threads type_k type_v fa test t/s qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B SYCL 99 1 q8_0 q8_0 1 pp512 977.40 ± 2.02 qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B SYCL 99 1 q8_0 q8_0 1 tg128 70.54 ± 0.12 I've chucked all my notes in an LLM and created an article if you want to recreate the same setup. I am currently using this with oh my pi and its very usable. I was able to create a well-designed poker game without it going in a loop or hanging/crashing. I've also tried intels vllm before but couldn't get it to this kind of performance for a single request, I see that there are some updates, so I will give that another shot when I have the time. Would love to hear if anyone's running a similar setup with any optimizations I'm missing, or anything in there that's actually doing nothing? Always looking to squeeze out more. Also massive thanks to the llama.cpp contributors and everyone working to make local inferencing viable. The fact that I can do this kind of inferencing locally is only possible because of the people building and maintaining this stuff.   submitted by   /u/Atomynos_Atom [link]   [comments]
https://preview.redd.it/se5nr2z7tt4h1.png?width=3046&format=png&auto=webp&s=7db15b73afb749da236e5bb50ff96372f6a3239b Hi, Niels here from the open-source team at Hugging Face. It's been 2 …
https://preview.redd.it/se5nr2z7tt4h1.png?width=3046&format=png&auto=webp&s=7db15b73afb749da236e5bb50ff96372f6a3239b Hi, Niels here from the open-source team at Hugging Face. It's been 2 weeks since I launched paperswithcode.co, a revival of the website we all loved. It allows us to keep track of the state-of-the-art (SOTA) across various domains of AI, from agents to computer vision and time-series forecasting. I've just added conference support as a new feature. The idea is that you should be able to easily browse all papers of major AI conferences like NeurIPS, CVPR, and ICML. As CVPR 2026 takes place next week in Denver, USA, I've indexed all papers with corresponding arXiv IDs. They are categorized by task, and tagged with linked GitHub and project page URLs, Hugging Face artifacts, and evals. You can also browse the papers which were accepted for an Oral presentation as well as the Spotlight papers. You can try it at https://paperswithcode.co/conferences! Feel free to leave feedback.   submitted by   /u/NielsRogge [link]   [comments]
  submitted by   /u/DeltaSqueezer [link]   [comments]
Over the past few months, I've been working on a high-scale scraping pipeline to aggregate listings directly from company job boards and applicant tracking systems. Mapping over 100,000 distinct comp…
Over the past few months, I've been working on a high-scale scraping pipeline to aggregate listings directly from company job boards and applicant tracking systems. Mapping over 100,000 distinct companies to their career pages turned out to be a massive engineering headache, but it's finally stable. The result is a unified database of more than 2 million active job postings, which I'm opening up to everyone for free. I am running daily delta refreshes to keep it current. Dataset Overview Scale: 2M+ active job listings across 100,000+ unique companies. Format: Parquet. (To keep storage costs to minimum) Core Fields: job_title, company_name, company_website, job_description, location, post_date, and the original tracking URL. For more detailed info check here. Update Cadence: Refreshed daily straight from the source. View the stats here. (Currently it contains only minimal stats, but I plan on improving it based on the comments) Why I Built This Finding a clean, scaled, and up-to-date job dataset is surprisingly difficult. Most available options are either heavily gatekept by expensive subscription APIs or restricted to a single job board like LinkedIn. By scraping the actual employer sites directly, this collection sidesteps the noise and captures a much cleaner cross-section of the live market. How to Access It I set up a dedicated project space where you can grab the data directly: Open Job data Let me know what kind of analysis or projects you end up running with it. If you have questions about the engineering architecture behind handling this scale, or ideas for specific fields you'd like to see enriched next, let's discuss in the comments.   submitted by   /u/Invicto_50 [link]   [comments]
I got a reminder e-Mail from eBay about a MI50 I had put on my watch list after quite a while. Aside from needing to jerryrig a blower into the back and bootstrapping ROCm - how is it? In fact, what'…
I got a reminder e-Mail from eBay about a MI50 I had put on my watch list after quite a while. Aside from needing to jerryrig a blower into the back and bootstrapping ROCm - how is it? In fact, what's inference for LLMs like for non-CUDA? I know that image-gen is veeeeery hit or miss (although ComfyUI tries their very best) and TTS is, for all I know, CUDA bound right now. STT - like whisper.cpp - runs well enough on CPUs so that's a non-issue imo. Just curious; trying to spec a build out of curiosity for my homelab. All my previous ones would've blown way past 4k€ - so I keep looking and waiting, trying to hit 2-3k at most. I mostly just want 2-3 parallel inferences on a decent (~30B) model - doubtful I'll ever get good enough hardware for parallel 100B inference. xD So yeah, what's the current situation in non-CUDA-land? Thanks!   submitted by   /u/IngwiePhoenix [link]   [comments]
https://www.reddit.com/r/LocalLLM/comments/1tuf6l1/intel_arc_pro_b70_llamacpp_sycl_63_ts_on_qwen/   submitted by   /u/jacek2023 [link]   [comments]
https://preview.redd.it/xc0l68bj7t4h1.png?width=616&format=png&auto=webp&s=48a8b14bc4ae95700cd4efa76772f4e71fb2d41a https://huggingface.co/nvidia/LocateAnything-3B funny how they left thi…
https://preview.redd.it/xc0l68bj7t4h1.png?width=616&format=png&auto=webp&s=48a8b14bc4ae95700cd4efa76772f4e71fb2d41a https://huggingface.co/nvidia/LocateAnything-3B funny how they left this in the demo atleast it's honest   submitted by   /u/chocofoxy [link]   [comments]
https://huggingface.co/nvidia/Cosmos3-Super-Text2Image Nano: 16B Super: 64B Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and act…
https://huggingface.co/nvidia/Cosmos3-Super-Text2Image Nano: 16B Super: 64B Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory inputs. It serves as a foundational building block for a broad range of Physical AI applications and research spanning world understanding, world generation, simulation, and embodied policy learning. Haven't seen much here yet. Some twitter discussion: https://x.com/victormustar/status/2061354267546427595   submitted by   /u/RobotRobotWhatDoUSee [link]   [comments]
Moss tts 1.5 8b is better than fish audio s2 pro and qwen 3 tts voice clone tts. You can easily get more better quality if you set up the duration of the voice in output you want and some temperature…
Moss tts 1.5 8b is better than fish audio s2 pro and qwen 3 tts voice clone tts. You can easily get more better quality if you set up the duration of the voice in output you want and some temperature and other changes. This was just used on default setting. It can be improved more.   submitted by   /u/9r4n4y [link]   [comments]
Article URL: https://blog.janestreet.com/strace-ui-bonsai-term-and-the-tui-renaissance/ Comments URL: https://news.ycombinator.com/item?id=48365904 Points: 101 # Comments: 55
initial version of official Gemma skills from Google   submitted by   /u/jacek2023 [link]   [comments]
Someone should create llama.ccp (not .cpp) that support LLMs on Chinese-native hardware (like Huawei’s Ascend 950PR), they are advancing fast in the recent months. Just thought the name would be funn…
Someone should create llama.ccp (not .cpp) that support LLMs on Chinese-native hardware (like Huawei’s Ascend 950PR), they are advancing fast in the recent months. Just thought the name would be funny.   submitted by   /u/Pancake502 [link]   [comments]
Please post your personal projects, startups, product placements, collaboration needs, blogs etc. Please mention the payment and pricing requirements for products and services. Please do not post lin…
Please post your personal projects, startups, product placements, collaboration needs, blogs etc. Please mention the payment and pricing requirements for products and services. Please do not post link shorteners, link aggregator websites , or auto-subscribe links. -- Any abuse of trust will lead to bans. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. -- Meta: This is an experiment. If the community doesnt like this, we will cancel it. This is to encourage those in the community to promote their work by not spamming the main threads.   submitted by   /u/AutoModerator [link]   [comments]
Now this is local AI innovation we can all get behind. https://x.com/stevencheng/status/2059836738449854898   submitted by   /u/No_Information9314 [link]   [comments]
Article URL: https://blog.hopefullyuseful.com/blog/macos-needs-its-grid-back/ Comments URL: https://news.ycombinator.com/item?id=48364800 Points: 138 # Comments: 66
79% of enterprises have adopted AI agents. Only 11% run them in production. We've spent the past year building agent systems for banks, clinical operations teams, and engineering orgs. The problem is…
79% of enterprises have adopted AI agents. Only 11% run them in production. We've spent the past year building agent systems for banks, clinical operations teams, and engineering orgs. The problem isn't that agents don't work — they work fine. The problem is that every framework leaves compliance, cost governance, and crash recovery as exercises for the team. After the framework fails them in production. We built MeshFlow to close that gap. **The core idea:** treat governance as infrastructure, not middleware. Every agent step passes through a 15-step kernel that handles identity, rate limiting, budget enforcement, compliance profiles, input/output guardrails, PII detection, risk classification, tool permission, the LLM call itself, audit ledger write, and SLA recording — in that order, always, without configuration. ```python from meshflow import Workflow, CostCap, Agent wf = Workflow(cost_cap=CostCap(usd=5.00)) wf.add(Agent('researcher'), Agent('analyst'), Agent('writer')) result = wf.run('Write a competitive analysis of our market') # Compliant. Durable. Audited. Cost-capped. Done. ``` ```bash pip install meshflow ``` **What's technically interesting:** **Token optimization layer** — five compounding mechanisms that reduce LLM spend 70-85%: - `cache_control` on every system prompt and tool definition (Anthropic: 10% of normal price on cached tokens) - `ModelRouter`: task-type classification routes simple tasks to nano models (keyword + token-count heuristic, zero LLM call) - `ContextCompactor`: sliding window summarization activates at configurable token threshold - `RAGTokenBudget`: hard `max_chars` cap on knowledge injection with truncate/drop/tail strategies - `ContextDeduplicator`: shared context sent once for N parallel agents, not N times **SHA-256 audit chain** — each step record stores `prev_hash` (SHA-256 of the previous record) and `entry_hash` (SHA-256 of its own canonical fields). Modify any log entry and `verify_chain()` breaks. This is the artifact
Hey ML community, We’ve just open-sourced **MeshFlow** , a code-first, framework-agnostic runtime designed for governing and optimizing multi-agent systems in production. Most agent frameworks focus …
Hey ML community, We’ve just open-sourced **MeshFlow** , a code-first, framework-agnostic runtime designed for governing and optimizing multi-agent systems in production. Most agent frameworks focus on rapid prototyping, but ML and platform engineering teams usually run into hard bottlenecks around LLM cost scaling, evaluation alignment, and execution safety. MeshFlow tackles these from a runtime/infrastructure perspective. Here are the key ML and system features: * **Task-Based Model Routing** : Before an agent executes a node, MeshFlow runs an evaluation on task complexity, routing the execution to one of four model tiers (`nano`, `small`, `medium`, `large`). This cuts overall API costs by 50-60% by utilizing smaller local models (e.g. LLaMA-3-8B) for standard formatting or extraction and reservation of frontier models (e.g. Claude Opus) for high-complexity reasoning. * **Context Compactor & Summary Pruning Middleware** : Implements sliding window summarization and context deduplication across parallel agent teams to limit prompt length growth. * **System Prompt Caching** : Native injection of Anthropic `cache_control` tags when system prompts exceed 1024 tokens. * **Cost Regression Evaluation Gate** : Integrates with CI pipelines to evaluate agent changes against a golden scenario baseline, throwing failures if code updates introduce token cost regressions. * **Resilient State Persistence** : Multi-backend state serialization (Redis, PostgreSQL, S3) that preserves checkpoint frames and allows resuming paused workflows. Here is the basic API contract: ```python from meshflow import Workflow, Agent, CostCap wf = Workflow(cost_cap=CostCap(usd=5.00)) wf.add(Agent('researcher'), Agent('critic'), Agent('writer')) result = wf.run('Compile comparative literature review of LLM reasoning pathways') print(result) ``` We'd love to discuss: 1. How do you handle token budget enforcement and model routing in your agent loops? 2. What evaluation pipelines do you use to detec
Article URL: https://www.zach.be/p/how-the-hell-is-groq-raising-more Comments URL: https://news.ycombinator.com/item?id=48364620 Points: 101 # Comments: 47
https://archive.ph/nKEVw Comments URL: https://news.ycombinator.com/item?id=48364055 Points: 104 # Comments: 221
3x 24GB vram. Qwen-coder-next is not bad. I'll continue to use it if you yell enough at me. I do a lot of front-end work, which develops rapidly, so the most recent the model the better. Larger tha…
3x 24GB vram. Qwen-coder-next is not bad. I'll continue to use it if you yell enough at me. I do a lot of front-end work, which develops rapidly, so the most recent the model the better. Larger than 80B and I'll have to sacrifice the decentish Q6 quant, or the minimum (for coding) 256k context. I do NOT believe that the latest 27-31B dense models can realistically beat an 80B model, even if I stomach the slowness, but change my mind. Slowness is an issue since I do NOT yolo. I micro-manage the heck out of the agent. It's actually more efficient than letting it rip, then having it rip again the next day because it had been climbing the wrong ladder.   submitted by   /u/ParaboloidalCrest [link]   [comments]
  submitted by   /u/Diablo-D3 [link]   [comments]
Article URL: https://mullvad.net/en/blog/age-verification-for-social-media-the-beginning-of-the-end-for-a-free-internet Comments URL: https://news.ycombinator.com/item?id=48363882 Points: 109 # Comme…
Article URL: https://mullvad.net/en/blog/age-verification-for-social-media-the-beginning-of-the-end-for-a-free-internet Comments URL: https://news.ycombinator.com/item?id=48363882 Points: 109 # Comments: 57
Hi everyone, I missed the ICML conference tickets because I was waiting for some travel funding confirmation and now they are sold out. Do you know any other ways I could still purchase one? There se…
Hi everyone, I missed the ICML conference tickets because I was waiting for some travel funding confirmation and now they are sold out. Do you know any other ways I could still purchase one? There seems to be no waiting list… or if you know anyone who needs to cancel theirs, please let me know 🙏🏻   submitted by   /u/TopPerformance1255 [link]   [comments]
Article URL: https://github.com/cyberpapiii/chipotlai-max Comments URL: https://news.ycombinator.com/item?id=48363765 Points: 124 # Comments: 25
It seems that there are two ways to build voice AI: Half-duplex: strict turn-taking. You speak, the other side waits until you’re done, one direction of speech at a time. ← This is how almost every v…
It seems that there are two ways to build voice AI: Half-duplex: strict turn-taking. You speak, the other side waits until you’re done, one direction of speech at a time. ← This is how almost every voice assistant works today. Full-duplex: two channels, both sides can talk at any time - no more waiting for your “turn”. ← This is the way humans actually talk. In fact, there are three crucial things half-duplex voice models can't really do: Overlap - talking and listening at the same time without falling apart Backchannels - the "mhms," "rights," and "yeahs" you drop in while the other person is still going Barge-in - getting interrupted mid-sentence and recovering gracefully These three features are a big reason why voice agents still feel “robotic” to this day. But what exactly is the spectrum from half-duplex to full-duplex? Is a Moshi-style architecture the only way to approach full-duplex natural voice conversations? What are ways half-duplex systems could imitate full-duplex? Would love to hear others' thoughts on this.   submitted by   /u/Chilly5 [link]   [comments]
Currently using cloud models for my browser use and it’s great when it works but it’s one of the last things keeping me subscribed. What are you brilliant people doing to allow agentic browser use? F…
Currently using cloud models for my browser use and it’s great when it works but it’s one of the last things keeping me subscribed. What are you brilliant people doing to allow agentic browser use? For context M1 ultra Llamacpp w my own UI   submitted by   /u/AdInternational5848 [link]   [comments]
https://huggingface.co/openbmb/MiniCPM5-1B What even is this thing? MiniCPM 4.6 was a tuned Qwen 3.5 0.8B, but this looks like something else. It doesn't have vision, and it apparently has its own to…
https://huggingface.co/openbmb/MiniCPM5-1B What even is this thing? MiniCPM 4.6 was a tuned Qwen 3.5 0.8B, but this looks like something else. It doesn't have vision, and it apparently has its own tokenizer. The model itself is aware of existence of Qwen 2.5, but says it's not that. Is it a new model from scratch? I don't use agents, but I checked out mradermacher's Heretic Q6_K a bit and it seems to work quite fine. Pretty reasonable and brief thinking, unlike the "but wait" infinite loop of newer Qwens. And its speech pattern seems different from other small models I've tried. Hey, does nobody here get hyped about new tiny models anymore? Where's everybody?   submitted by   /u/WhoRoger [link]   [comments]
I wasn't sure whether to post this here or not but a friend of mine said that a lot of researchers lurk into this subreddit and it might help them, and I think it might also help anyone trying to tin…
I wasn't sure whether to post this here or not but a friend of mine said that a lot of researchers lurk into this subreddit and it might help them, and I think it might also help anyone trying to tinker with stuff at home, I don't know how much people do post-training here but I do see distills getting posted here and fine-tunings and datasets and benchmarks etc., so I think it might be interesting to you. For context, I work on post-training for agentic and tool-use capabilities, and I spent a few months a while ago almost literally living inside verl, ByteDance's RL post-training framework. I read most of the source and absorbed almost all of its knowledge and as I was working with it, I started wanting a "better" version, something with better dev experience for me, so I forked it (non-public, I abandoned it) to make it better (in my view) and while I shipped a lot fixes, and built tooling around it, at one point I had to stop, and it left a hole in my chest and I was finally wrote the whole thing up. As an au-revoir to it but also to get heir from it, all the knowledge and skill that I've learned from it. It's a close read of the parts that actually run an RLHF loop, plus some of the engineering a fork drags in, nothing major though, and one debugging story I'm still a little proud of. A quick tour of what the blog post is about: - The orchestration layer's internals: everything from the data structure (DataProto) every stage (rollout, reward, advantage, update) passes, and the API gotchas its names don't warn you about. There's also a half-finished migration to a plain TensorDict underneath it. - The single-controller pattern: one driver process holds the schedule and fans work out to GPU workers through a "magic attribute" dispatch system. That one is nasty, it took me so much time to wrap my head around it, but now that I do, it just feels so natural and helps me work on my own little package for orchestration layer in outmost confidence a
Can we please ban the daily "I have an RTX 3060, what should I run?" slop threads? It’s not complicated. As of right now, Hugging Face is empty and exactly two local models exist on this en…
Can we please ban the daily "I have an RTX 3060, what should I run?" slop threads? It’s not complicated. As of right now, Hugging Face is empty and exactly two local models exist on this entire planet: Qwen 3.6 35b a3b Qwen 3.6 27b That is the entire list. Your specs don’t matter. Your use case doesn’t matter. Stop coping with your pristine, full-precision Q8s of tiny 1B models just because they "fit perfectly in your VRAM." You look ridiculous. Grab a heavily brain-damaged, ultra-low quant of the 35B, force-feed it to your GPU, and let your system RAM bleed. A garbage quant of a massive model is a bagillion times better than your precious micro-models anyway. Just cram it in. And if you're going to whine that open source is dead because a local model won't instantly rewrite your entire enterprise codebase? Fine. Give up, pull out your credit card, and go spend your money on Claude Code like the rest of the contrarians. Can we pin this so everyone can finally shut up and stop posting? Thanks. Now, that has been solved lets go touch grass.   submitted by   /u/Wrong_Mushroom_7350 [link]   [comments]
Article URL: https://openai.com/index/openai-frontier-models-and-codex-are-now-available-on-aws/ Comments URL: https://news.ycombinator.com/item?id=48363132 Points: 130 # Comments: 45
best open Model quant that run in 3060 12gb and it's equivalent closed model in speed and time , for agentic coding.   submitted by   /u/Mother_Desk6385 [link]   [comments]
Check the slides from Computex. Every outlet that reported 600GB/s is completely wrong. That is the NvLink speed like everyone here said.   submitted by   /u/rpiguy9907 [link]   [comm…
Check the slides from Computex. Every outlet that reported 600GB/s is completely wrong. That is the NvLink speed like everyone here said.   submitted by   /u/rpiguy9907 [link]   [comments]
System Prompt: You are an expert software developer. Prompt: Task: make a Sonic The Hedgehog-like platform game Scaffold: none - just a single message in openwebui Model: Stepfun 3.7 Flash official Q…
System Prompt: You are an expert software developer. Prompt: Task: make a Sonic The Hedgehog-like platform game Scaffold: none - just a single message in openwebui Model: Stepfun 3.7 Flash official Q4_K_S This was the first try don't think I've tried this prompt on other models before. Pretty impressed with the control feel and sense of speed.   submitted by   /u/-dysangel- [link]   [comments]
Article URL: https://abc.xyz/investor/news/news-details/2026/Alphabet-Announces-Proposed-80-Billion-Equity-Capital-Raise-to-Expand-AI-Infrastructure-and-Compute-2026-b0myAMewCa/default.aspx Comments …
Article URL: https://abc.xyz/investor/news/news-details/2026/Alphabet-Announces-Proposed-80-Billion-Equity-Capital-Raise-to-Expand-AI-Infrastructure-and-Compute-2026-b0myAMewCa/default.aspx Comments URL: https://news.ycombinator.com/item?id=48362515 Points: 108 # Comments: 105
Article URL: https://debug.com/ Comments URL: https://news.ycombinator.com/item?id=48362347 Points: 122 # Comments: 50
This is more of a quick appreciation post for Qwen 3.6 27B running locally (8-bit unsloth quant). I've been using it mainly alongside my 35B model in OpenCode for planning and coding. I also had it s…
This is more of a quick appreciation post for Qwen 3.6 27B running locally (8-bit unsloth quant). I've been using it mainly alongside my 35B model in OpenCode for planning and coding. I also had it set up in Open WebUI, but until MTP support came about two weeks ago in llama.cpp, the TPS was so painfully slow on OWUI that it was basically unusable for chat. Since then, I paired them together and have been using Qwen 27B as a daily chat assistant alongside Gemini Pro. I've been keeping a running mental comparison between the two. For straightforward questions, Gemini handles things fine. But over the weekend I dove into some career advice and company portfolio deep dives, plus some immigration research. Gemini completely fell apart on this. It started hallucinating and fixating on stuff based on earlier messages in the conversation and my previous chats. I think this degradation have started to happen over last couple of weeks or so, wanted to know others experience with gemini lately. I ended up doing a lot of manual research myself. Then I decided to try same research with Qwen 3.6 27B. I was genuinely surprised by how much better it performed on both the career/company stuff and the immigration research. The immigration results really stood out because it had to actually go through official documentation and make sense of it rather than just regurgitating something. Side note: I've also tried Gemma 4 31B, which I heard is great for research and planning, but it's just too slow on my M5 Max with 128GB with 8 bit quant. Curious to know folks opinion here on that and maybe once MTP is enabled for that I will try it.   submitted by   /u/Character_Split4906 [link]   [comments]
Hey everyone, hope this is ok to post here. I built a free EU AI Act risk assessment tool and would love some feedback from people who actually know this space. You fill out a 10-question form descri…
Hey everyone, hope this is ok to post here. I built a free EU AI Act risk assessment tool and would love some feedback from people who actually know this space. You fill out a 10-question form describing your AI system, it classifies your EU AI Act risk tier, and emails you a PDF report with your applicable Articles and priority actions. Takes about 2 minutes, no account required. https://assessment.aiella.com Eventually I want to build a monitoring SDK that works like a Python library and automatically documents compliance of the technically measurable requirements at inference time. Looking for design partners for that down the road. Genuine feedback welcome, especially from anyone who has been through a real EU AI Act compliance process. Happy to answer questions about the classification methodology or the AWS architecture behind it.   submitted by   /u/aiandi [link]   [comments]
Them boys can cook, one big fix after another! If you're running --sm tensor on multi-gpu this is the KV cache quantization fix https://github.com/ggml-org/llama.cpp/releases/tag/b9455 JohannesGaess…
Them boys can cook, one big fix after another! If you're running --sm tensor on multi-gpu this is the KV cache quantization fix https://github.com/ggml-org/llama.cpp/releases/tag/b9455 JohannesGaesslercommented5 days ago This PR implements support for the combination of -sm tensor and quantized KV cache. The reason why this doesn't work on master is that the flattening of tensors for the KV cache rotation leads to the loss of shape information which the meta backend cannot handle. There were previous PRs which resolved the issue by changing the shapes of the KV cache rotation but that is an undesirable solution because batched matrix multiplications may not be as well-supported in ggml backends as a single large matrix multiplication. Also it is generally better to extend the meta backend with capabilities to handle a compute graph than to require compute graphs to conform to the meta backend's requirments. The approach in this PR is to extend the specification ggml_backend_meta_split_state with a value that specifies how often a given segment repeats. When a tensor is flattened the meta backend uses segments to specify the data layout within the flattened dimension so that upon a further reshape the correct data layout can be restored. No changes to the llama.cpp compute graphs are required.   submitted by   /u/Bulky-Priority6824 [link]   [comments]
https://www.scan.co.uk/shop/ai-and-robotics/workstations-ai/nvidia-dgx-station   submitted by   /u/X-N2O [link]   [comments]
https://www.neowin.net/news/computex-2026-intel-launches-crescent-island-gpu-with-up-to-480gb-vram/ Crescent Island is based on the company"s Arc Xe 3P architecture which lies inside current Pa…
https://www.neowin.net/news/computex-2026-intel-launches-crescent-island-gpu-with-up-to-480gb-vram/ Crescent Island is based on the company"s Arc Xe 3P architecture which lies inside current Panther Lake iGPs as well. This is Intel"s latest, most powerful card and it packs up to 480 GB of VRAM capacity. Unlike typical high-end professional GPUs which rely on HBM for improving power efficiency, the Intel GPU here has LPDDR5X. Cooling on the unit is handled by air cooler that can handle a TDP of 350 watts. Intel says that these cards can deal with next generation AI workloads and come with support for a wide range of datatypes and microscaling formats, from native FP4/MXFP4 to FP64, and more.   submitted by   /u/ANR2ME [link]   [comments]
Article URL: https://eblog.fly.dev/githubbad.html Comments URL: https://news.ycombinator.com/item?id=48361064 Points: 115 # Comments: 42
Article URL: https://discuss.grapheneos.org/d/36001-grapheneos-speech-services-version-2-released Comments URL: https://news.ycombinator.com/item?id=48360871 Points: 102 # Comments: 16
We recently hit a classic gradient boosting trap with our pricing engine (Flyback), and I wanted to share the ablation data. We run LightGBM quantile regression to forecast secondary market watch pri…
We recently hit a classic gradient boosting trap with our pricing engine (Flyback), and I wanted to share the ablation data. We run LightGBM quantile regression to forecast secondary market watch prices. We engineered a variant-conditioned Bayesian target encoder to isolate within-reference pricing dynamics. LightGBM absolutely loved it. It ranked #1 in feature importance at q90 by a wide margin, with gains several times the next-highest feature, across all our multi seed runs. But when we ran a strict 4-seed × 3-variant ablation on the hold-out set, the results inverted. Test MAPE regressed by +0.28pp and the between-variant delta was 7x the within-variant standard deviation. The encoder was finding effective splits that completely failed to generalize because the signal it was learning was driven by irreducible label variance: unobserved factors like condition nuance, seller behavior, and timing that no feature can capture. I wrote a full post breaking down the architecture, the ablation methodology, and the mechanism behind the divergence. Happy to discuss LightGBM split mechanics, target encoding leakage, or the ablation setup. Full post and ablation results: https://flyback.ai/engineering/target-encoding-divergence   submitted by   /u/Nj-yeti [link]   [comments]
We just assumed that since it's a GB10 variant that it would have the same memory bandwidth as DGX Spark, 273GB/s. But it's reported that it will have double that, 600GB/s. "The unified memory a…
We just assumed that since it's a GB10 variant that it would have the same memory bandwidth as DGX Spark, 273GB/s. But it's reported that it will have double that, 600GB/s. "The unified memory architecture brings up to 128GB of LPDDR5X RAM with a bandwidth of 600GB/s" https://wccftech.com/nvidia-enters-pc-space-with-rtx-spark/ "its memory bandwidth peaks at 600 GB/s" https://www.notebookcheck.net/Nvidia-N1X-officially-confirmed-to-arrive-as-the-RTX-Spark.1312010.0.html   submitted by   /u/fallingdowndizzyvr [link]   [comments]