15 articles from r/MachineLearning
Sometime late last year a company called Logical Intelligence developed an EBM called Kona. What do people make of the company’s claims that they have a close to functioning EBM. And if true, what im…
Sometime late last year a company called Logical Intelligence developed an EBM called Kona. What do people make of the company’s claims that they have a close to functioning EBM. And if true, what impact would this have on existing AI?   submitted by   /u/Treey1234 [link]   [comments]
My paper is around 33 pages including but tpami guideline said it should be 20 pages Does anyone know which is correct? Its mistake it’s TPAMI   submitted by   /u/Alternative_Art2984 [lin…
My paper is around 33 pages including but tpami guideline said it should be 20 pages Does anyone know which is correct? Its mistake it’s TPAMI   submitted by   /u/Alternative_Art2984 [link]   [comments]
I was wondering if someone knew state of the art research about the hallucination problem for document search with LLMs. I know for example in math you can use some verifier to check a proof. What ab…
I was wondering if someone knew state of the art research about the hallucination problem for document search with LLMs. I know for example in math you can use some verifier to check a proof. What about document search with LLMs, when I feed them documents?   submitted by   /u/Saladino93 [link]   [comments]
Third in a series of papers tracking learning rules vs. human fMRI (THINGS dataset, V1–IT, N=3 subjects). Previous finding: untrained CNNs match backprop at V1. This paper asks: when does training br…
Third in a series of papers tracking learning rules vs. human fMRI (THINGS dataset, V1–IT, N=3 subjects). Previous finding: untrained CNNs match backprop at V1. This paper asks: when does training break that, and does the learning rule matter? Setup: RSA alignment measured at 8 checkpoints (epochs 0, 1, 2, 5, 10, 20, 30, 40), 5 seeds per rule, same architecture throughout. Main findings: BP drops 90% of V1 alignment after one epoch (r: 0.102 → 0.011, p = 0.031, consistent across all 5 seeds). FA drops 49%. PC and STDP drop only 25–31% and stabilise. By epoch 40: PC (r = 0.064) > STDP (0.059) >> BP (0.022) ≈ FA (0.019). Cohen's d > 5 for PC/STDP vs BP: extremely consistent across seeds. Opposing trend at LOC: BP shows a small increase in object-selective cortex alignment (+0.011) while local rules show nothing. Suggests a fundamental trade-off: global error signals build higher representations but destroy early ones. Degradation rate tracks error signal globality: exact gradients (BP) > random feedback (FA) > local prediction errors (PC, STDP). Limitations worth noting: 5 seeds caps permutation test resolution at p ≈ 0.031 Training on 32×32 CIFAR-10, evaluated on 224×224 THINGS, resolution/domain shift is a confound LOC increase not tested for significance, treated as suggestive Paper: arxiv.org/abs/2605.30556 Companion: arxiv.org/abs/2604.16875 Code: github.com/nilsleut Curious whether anyone has seen similar dynamics in larger architectures, the prediction would be that deeper models show the same pattern but more slowly.   submitted by   /u/ConfusionSpiritual19 [link]   [comments]
https://arxiv.org/abs/2605.29713 The Little Book of Generative AI Foundations: An Intuitive Mathematical Primer Tianhua Chen Abstract: "This book provides a compact, derivation-oriented introdu…
https://arxiv.org/abs/2605.29713 The Little Book of Generative AI Foundations: An Intuitive Mathematical Primer Tianhua Chen Abstract: "This book provides a compact, derivation-oriented introduction to the mathematical foundations of modern generative artificial intelligence. Rather than surveying every recent architecture or implementation detail, it develops a coherent route through the ideas connecting major families of generative models, from PCA, probabilistic PCA, variational autoencoders, and diffusion models to normalising flows, autoregressive factorisations, GANs, Wasserstein GANs, and energy-based models. The aim is to make the structure of generative modelling more accessible without removing the mathematical substance needed to understand how these models are derived and related. The book is intended as a foundation-building primer for mathematically curious researchers, practitioners, and students."   submitted by   /u/Nunki08 [link]   [comments]
presenting a poster there, and have registration covered. but they are placing me on waitlist for travel funds. As my travel depends on whether I get the travel grant, I need to get this off of my mi…
presenting a poster there, and have registration covered. but they are placing me on waitlist for travel funds. As my travel depends on whether I get the travel grant, I need to get this off of my mind, either invite me or just say no. I'm waiting forever for this, more wait again? should i ask for a decision, or what to do.   submitted by   /u/Active-Tip3130 [link]   [comments]
I built CVE-Bench: 20 real-world CVEs across 18 Python projects (Pillow, GitPython, yt-dlp, urllib3, others), 5 frontier models, 3 prompt conditions, 300 runs total. Each agent runs in a sandboxed co…
I built CVE-Bench: 20 real-world CVEs across 18 Python projects (Pillow, GitPython, yt-dlp, urllib3, others), 5 frontier models, 3 prompt conditions, 300 runs total. Each agent runs in a sandboxed container and is scored against a hidden test_security.py derived from the maintainer's own fix. Binary pass/fail (a 90%-patched vulnerability is still a vulnerability). To better understand failure modes, I've tested three prompt conditions: advisory (full GHSA report), diagnose (exploit description only, no file or function), and locate (exact file and function, no description of the flaw). The three conditions test meaningfully different things. A model that does well on advisory but drops on diagnose can’t translate a behavioral description into a location in the codebase. A model that holds up on locate is recognizing dangerous code on its own. The leaderboard isn't the finding. Best solve rate is 50% overall, 60% under advisory. Cross-family separation (OpenAI vs Laguna) is confirmed under McNemar's test with continuity correction (all four pairs cross α = 0.05). Within-family gaps are noise: a power analysis puts the task count needed to detect a meaningful within-family edge at ~700. That cuts both ways: if the expensive models had a large true advantage, 20 tasks would have been enough to surface it. gpt-5.5 at 12× the cost of gpt-5.4-mini is not the rational choice. All four cross-family pairwise comparisons reach statistical significance at α = 0.05 (McNemar test with continuity correction, n = 60 tasks per model pair): gpt-5.5 vs laguna-m.1 (p = 0.015), gpt-5.4-nano vs laguna-m.1 (p = 0.017), gpt-5.5 vs laguna-xs.2 (p = 0.028), gpt-5.4-nano vs laguna-xs.2 (p = 0.040). Within-family comparisons remain far from significance; those rankings should be read as approximate. The failure taxonomy is the most interesting finding. Wrong-search drift — model finds the right file early, makes one incorrect inference, spends the remaining turns chasing it. Budget expi
https://preview.redd.it/se5nr2z7tt4h1.png?width=3046&format=png&auto=webp&s=7db15b73afb749da236e5bb50ff96372f6a3239b Hi, Niels here from the open-source team at Hugging Face. It's been 2 …
https://preview.redd.it/se5nr2z7tt4h1.png?width=3046&format=png&auto=webp&s=7db15b73afb749da236e5bb50ff96372f6a3239b Hi, Niels here from the open-source team at Hugging Face. It's been 2 weeks since I launched paperswithcode.co, a revival of the website we all loved. It allows us to keep track of the state-of-the-art (SOTA) across various domains of AI, from agents to computer vision and time-series forecasting. I've just added conference support as a new feature. The idea is that you should be able to easily browse all papers of major AI conferences like NeurIPS, CVPR, and ICML. As CVPR 2026 takes place next week in Denver, USA, I've indexed all papers with corresponding arXiv IDs. They are categorized by task, and tagged with linked GitHub and project page URLs, Hugging Face artifacts, and evals. You can also browse the papers which were accepted for an Oral presentation as well as the Spotlight papers. You can try it at https://paperswithcode.co/conferences! Feel free to leave feedback.   submitted by   /u/NielsRogge [link]   [comments]
Over the past few months, I've been working on a high-scale scraping pipeline to aggregate listings directly from company job boards and applicant tracking systems. Mapping over 100,000 distinct comp…
Over the past few months, I've been working on a high-scale scraping pipeline to aggregate listings directly from company job boards and applicant tracking systems. Mapping over 100,000 distinct companies to their career pages turned out to be a massive engineering headache, but it's finally stable. The result is a unified database of more than 2 million active job postings, which I'm opening up to everyone for free. I am running daily delta refreshes to keep it current. Dataset Overview Scale: 2M+ active job listings across 100,000+ unique companies. Format: Parquet. (To keep storage costs to minimum) Core Fields: job_title, company_name, company_website, job_description, location, post_date, and the original tracking URL. For more detailed info check here. Update Cadence: Refreshed daily straight from the source. View the stats here. (Currently it contains only minimal stats, but I plan on improving it based on the comments) Why I Built This Finding a clean, scaled, and up-to-date job dataset is surprisingly difficult. Most available options are either heavily gatekept by expensive subscription APIs or restricted to a single job board like LinkedIn. By scraping the actual employer sites directly, this collection sidesteps the noise and captures a much cleaner cross-section of the live market. How to Access It I set up a dedicated project space where you can grab the data directly: Open Job data Let me know what kind of analysis or projects you end up running with it. If you have questions about the engineering architecture behind handling this scale, or ideas for specific fields you'd like to see enriched next, let's discuss in the comments.   submitted by   /u/Invicto_50 [link]   [comments]
Please post your personal projects, startups, product placements, collaboration needs, blogs etc. Please mention the payment and pricing requirements for products and services. Please do not post lin…
Please post your personal projects, startups, product placements, collaboration needs, blogs etc. Please mention the payment and pricing requirements for products and services. Please do not post link shorteners, link aggregator websites , or auto-subscribe links. -- Any abuse of trust will lead to bans. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. -- Meta: This is an experiment. If the community doesnt like this, we will cancel it. This is to encourage those in the community to promote their work by not spamming the main threads.   submitted by   /u/AutoModerator [link]   [comments]
79% of enterprises have adopted AI agents. Only 11% run them in production. We've spent the past year building agent systems for banks, clinical operations teams, and engineering orgs. The problem is…
79% of enterprises have adopted AI agents. Only 11% run them in production. We've spent the past year building agent systems for banks, clinical operations teams, and engineering orgs. The problem isn't that agents don't work — they work fine. The problem is that every framework leaves compliance, cost governance, and crash recovery as exercises for the team. After the framework fails them in production. We built MeshFlow to close that gap. **The core idea:** treat governance as infrastructure, not middleware. Every agent step passes through a 15-step kernel that handles identity, rate limiting, budget enforcement, compliance profiles, input/output guardrails, PII detection, risk classification, tool permission, the LLM call itself, audit ledger write, and SLA recording — in that order, always, without configuration. ```python from meshflow import Workflow, CostCap, Agent wf = Workflow(cost_cap=CostCap(usd=5.00)) wf.add(Agent('researcher'), Agent('analyst'), Agent('writer')) result = wf.run('Write a competitive analysis of our market') # Compliant. Durable. Audited. Cost-capped. Done. ``` ```bash pip install meshflow ``` **What's technically interesting:** **Token optimization layer** — five compounding mechanisms that reduce LLM spend 70-85%: - `cache_control` on every system prompt and tool definition (Anthropic: 10% of normal price on cached tokens) - `ModelRouter`: task-type classification routes simple tasks to nano models (keyword + token-count heuristic, zero LLM call) - `ContextCompactor`: sliding window summarization activates at configurable token threshold - `RAGTokenBudget`: hard `max_chars` cap on knowledge injection with truncate/drop/tail strategies - `ContextDeduplicator`: shared context sent once for N parallel agents, not N times **SHA-256 audit chain** — each step record stores `prev_hash` (SHA-256 of the previous record) and `entry_hash` (SHA-256 of its own canonical fields). Modify any log entry and `verify_chain()` breaks. This is the artifact
Hey ML community, We’ve just open-sourced **MeshFlow** , a code-first, framework-agnostic runtime designed for governing and optimizing multi-agent systems in production. Most agent frameworks focus …
Hey ML community, We’ve just open-sourced **MeshFlow** , a code-first, framework-agnostic runtime designed for governing and optimizing multi-agent systems in production. Most agent frameworks focus on rapid prototyping, but ML and platform engineering teams usually run into hard bottlenecks around LLM cost scaling, evaluation alignment, and execution safety. MeshFlow tackles these from a runtime/infrastructure perspective. Here are the key ML and system features: * **Task-Based Model Routing** : Before an agent executes a node, MeshFlow runs an evaluation on task complexity, routing the execution to one of four model tiers (`nano`, `small`, `medium`, `large`). This cuts overall API costs by 50-60% by utilizing smaller local models (e.g. LLaMA-3-8B) for standard formatting or extraction and reservation of frontier models (e.g. Claude Opus) for high-complexity reasoning. * **Context Compactor & Summary Pruning Middleware** : Implements sliding window summarization and context deduplication across parallel agent teams to limit prompt length growth. * **System Prompt Caching** : Native injection of Anthropic `cache_control` tags when system prompts exceed 1024 tokens. * **Cost Regression Evaluation Gate** : Integrates with CI pipelines to evaluate agent changes against a golden scenario baseline, throwing failures if code updates introduce token cost regressions. * **Resilient State Persistence** : Multi-backend state serialization (Redis, PostgreSQL, S3) that preserves checkpoint frames and allows resuming paused workflows. Here is the basic API contract: ```python from meshflow import Workflow, Agent, CostCap wf = Workflow(cost_cap=CostCap(usd=5.00)) wf.add(Agent('researcher'), Agent('critic'), Agent('writer')) result = wf.run('Compile comparative literature review of LLM reasoning pathways') print(result) ``` We'd love to discuss: 1. How do you handle token budget enforcement and model routing in your agent loops? 2. What evaluation pipelines do you use to detec
Hi everyone, I missed the ICML conference tickets because I was waiting for some travel funding confirmation and now they are sold out. Do you know any other ways I could still purchase one? There se…
Hi everyone, I missed the ICML conference tickets because I was waiting for some travel funding confirmation and now they are sold out. Do you know any other ways I could still purchase one? There seems to be no waiting list… or if you know anyone who needs to cancel theirs, please let me know 🙏🏻   submitted by   /u/TopPerformance1255 [link]   [comments]
It seems that there are two ways to build voice AI: Half-duplex: strict turn-taking. You speak, the other side waits until you’re done, one direction of speech at a time. ← This is how almost every v…
It seems that there are two ways to build voice AI: Half-duplex: strict turn-taking. You speak, the other side waits until you’re done, one direction of speech at a time. ← This is how almost every voice assistant works today. Full-duplex: two channels, both sides can talk at any time - no more waiting for your “turn”. ← This is the way humans actually talk. In fact, there are three crucial things half-duplex voice models can't really do: Overlap - talking and listening at the same time without falling apart Backchannels - the "mhms," "rights," and "yeahs" you drop in while the other person is still going Barge-in - getting interrupted mid-sentence and recovering gracefully These three features are a big reason why voice agents still feel “robotic” to this day. But what exactly is the spectrum from half-duplex to full-duplex? Is a Moshi-style architecture the only way to approach full-duplex natural voice conversations? What are ways half-duplex systems could imitate full-duplex? Would love to hear others' thoughts on this.   submitted by   /u/Chilly5 [link]   [comments]
Hey everyone, hope this is ok to post here. I built a free EU AI Act risk assessment tool and would love some feedback from people who actually know this space. You fill out a 10-question form descri…
Hey everyone, hope this is ok to post here. I built a free EU AI Act risk assessment tool and would love some feedback from people who actually know this space. You fill out a 10-question form describing your AI system, it classifies your EU AI Act risk tier, and emails you a PDF report with your applicable Articles and priority actions. Takes about 2 minutes, no account required. https://assessment.aiella.com Eventually I want to build a monitoring SDK that works like a Python library and automatically documents compliance of the technically measurable requirements at inference time. Looking for design partners for that down the road. Genuine feedback welcome, especially from anyone who has been through a real EU AI Act compliance process. Happy to answer questions about the classification methodology or the AWS architecture behind it.   submitted by   /u/aiandi [link]   [comments]