8 articles
President Donald Trump signed an executive order Tuesday creating a "voluntary framework" for AI companies to share their frontier models with the federal government before they're released "to promo…
President Donald Trump signed an executive order Tuesday creating a "voluntary framework" for AI companies to share their frontier models with the federal government before they're released "to promote secure innovation and strengthen the cybersecurity of critical infrastructure." The order says the US AI industry has succeeded in part "because we refuse to stifle this […]
Just released a deep benchmark of 8 tiny LLMs (135M → ~1B) on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA - across all 4 power modes: 7W, 15W, 25W, and MAXN Hardware: NVIDIA Ampere GPU …
Just released a deep benchmark of 8 tiny LLMs (135M → ~1B) on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA - across all 4 power modes: 7W, 15W, 25W, and MAXN Hardware: NVIDIA Ampere GPU - 1024 CUDA cores, 32 Tensor cores 6× Arm Cortex-A78AE CPU @ 1.728 GHz 8 GB LPDDR5 @ 204.8 GB/s (unified CPU + GPU - no VRAM split) Active fan cooling - peak junction temp stayed ≤ 73 °C across every run Stack: JetPack R36.4.7 (Ubuntu 22.04), CUDA 12.6 llama.cpp CUDA backend, all layers on GPU (-ngl 99) Load: NVIDIA aiperf — 20 requests per combo, 12 prompt × gen combos per model Power measured via tegrastats VDD_CPU_GPU_CV rail at 500ms intervals Brief methodology: Sweep: prompt ∈ {128, 512, 1024, 2048} tokens × gen ∈ {64, 128, 256} tokens × 4 power modes = 384 benchmark cells per model, 8 models. Key metric: output tok/J = tokens generated per joule of compute energy Findings: Key finding: 25W is the Pareto-optimal mode for every model we have tested. 36–47% more tok/s than 15W 3–26% better output tok/J than 15W 8–35% better output tok/J than MAXN More clocks ≠ more efficiency. MAXN costs ~17% more power for marginal throughput gains. Sub-1B standouts at 25W: SmolLM2-135M - 165 tok/s, 22.6 tok/J (best in suite), 101 MB, ~5.4W. LFM2.5-350M - 120 tok/s in 219 MB. Matches SmolLM2-360M (369 MB) at less than half the size. ~1B class at 25W (ctx=2048, gen=256): LFM2.5-1.2B: 54.1 tok/s, 5.26 tok/J, 698 MB - fastest + best output tok/J in ~1B class Gemma3-1B: edges ahead on total tok/J (118.5 vs 116.2) - lower power draw (6.87W vs 8.46W) compensates for slower decode Llama3.2-1B: 47.0 tok/s, 4.67 tok/J Full blog with all charts, heatmaps, latency tables, and raw HuggingFace datasets (384 cells × 4 modes) linked in the blog! Do check it out — and if you have a Jetson, what are you running on it? Would love to know! Blog   submitted by   /u/East-Muffin-6472 [link]   [comments]
A new AI compliance service sits between AI models and end users to flag and replace any messages that might present a compliance problem.
I use local ai mainly for creative writing, and benchmarks are a bit iffy on that I feel like. I’d like to compare Gemma mainly to Gemini as I like their writing the best, I do know that qwen 3.6 is …
I use local ai mainly for creative writing, and benchmarks are a bit iffy on that I feel like. I’d like to compare Gemma mainly to Gemini as I like their writing the best, I do know that qwen 3.6 is amazing but mostly for coding and agentic work. I’d like to ask everyone how the new(er?) models feel to you personally rather than looking at benchmarks which they are likely optimised for. For me, I feel like Gemma 4 31B (even q4) still falls short of 2.5 pro, I’m most familiar with 2.5 pro since I used so much of it for free on ai studio when it was a preview. The style and prose are there but long context it still misremembers minor details. I think it’s actually better than gpt 4.5, but tha could be personal preference since, again, I do mostly only creative writing   submitted by   /u/opoot_ [link]   [comments]
Why is Gautam Adani betting so heavily on data centers right now, and what does he see coming that others might be missing? As AI reshapes industries, the real race may not be about building models, …
Why is Gautam Adani betting so heavily on data centers right now, and what does he see coming that others might be missing? As AI reshapes industries, the real race may not be about building models, but building the infrastructure that powers them. From m LinkedIn
3x 24GB vram. Qwen-coder-next is not bad. I'll continue to use it if you yell enough at me. I do a lot of front-end work, which develops rapidly, so the most recent the model the better. Larger tha…
3x 24GB vram. Qwen-coder-next is not bad. I'll continue to use it if you yell enough at me. I do a lot of front-end work, which develops rapidly, so the most recent the model the better. Larger than 80B and I'll have to sacrifice the decentish Q6 quant, or the minimum (for coding) 256k context. I do NOT believe that the latest 27-31B dense models can realistically beat an 80B model, even if I stomach the slowness, but change my mind. Slowness is an issue since I do NOT yolo. I micro-manage the heck out of the agent. It's actually more efficient than letting it rip, then having it rip again the next day because it had been climbing the wrong ladder.   submitted by   /u/ParaboloidalCrest [link]   [comments]
It seems that there are two ways to build voice AI: Half-duplex: strict turn-taking. You speak, the other side waits until you’re done, one direction of speech at a time. ← This is how almost every v…
It seems that there are two ways to build voice AI: Half-duplex: strict turn-taking. You speak, the other side waits until you’re done, one direction of speech at a time. ← This is how almost every voice assistant works today. Full-duplex: two channels, both sides can talk at any time - no more waiting for your “turn”. ← This is the way humans actually talk. In fact, there are three crucial things half-duplex voice models can't really do: Overlap - talking and listening at the same time without falling apart Backchannels - the "mhms," "rights," and "yeahs" you drop in while the other person is still going Barge-in - getting interrupted mid-sentence and recovering gracefully These three features are a big reason why voice agents still feel “robotic” to this day. But what exactly is the spectrum from half-duplex to full-duplex? Is a Moshi-style architecture the only way to approach full-duplex natural voice conversations? What are ways half-duplex systems could imitate full-duplex? Would love to hear others' thoughts on this.   submitted by   /u/Chilly5 [link]   [comments]
Article URL: https://openai.com/index/openai-frontier-models-and-codex-are-now-available-on-aws/ Comments URL: https://news.ycombinator.com/item?id=48363132 Points: 130 # Comments: 45