智能助手网
标签聚合 agents

/tag/agents

linux.do · 2026-04-18 15:48:47+08:00 · tech

由于对长期任务的各层级 AGENTS.md 有优化需求,我前期用codex做了一次调研。根据调研结果建立了这个skill的框架,然后用较完整的技能优化和评测的工作流,做了技能测试和改进,包括脚本调用codex-cli模拟真实仓库环境中,用skill和不用skill的测试。 目前技能已经能初步使用。大家如果在使用中遇到问题也可以自行优化一下,可以在帖子里分享给大家。 完全可以使用一下这个技能,让ai从完整记忆(比如说codex的memory全文,而不是rg搜索一部分记忆)中判断那些规则需要沉淀到全局AGETNS.md里,试试效果。 技能如下 agents-md-improver.zip (86.7 KB) 更完整的介绍留给codex帮我总结 1 个帖子 - 1 位参与者 阅读完整话题

linux.do · 2026-04-18 15:30:01+08:00 · tech

The Cloudflare Blog – 17 Apr 26 Agents that remember: introducing Agent Memory Cloudflare Agent Memory is a managed service that gives AI agents persistent memory, allowing them to recall what matters, forget what doesn't, and get smarter over time. [!quote]+ 今天,我们宣布推出Agent Memory的私有测试版,这是一项托管服务,可以从代理对话中提取信息,并在需要时提供这些信息,而不会填满上下文窗口。 它赋予人工智能代理持久记忆,使其能够记住重要信息,遗忘不重要信息,并随着时间的推移变得更加智能。 1 个帖子 - 1 位参与者 阅读完整话题

hnrss.org · 2026-04-18 04:00:04+08:00 · tech

Paper Lantern is an MCP server that lets coding agents ask for personalized techniques / ideas from 2M+ CS research papers. Your coding agent tells PL what problem it is working on --> PL finds the most relevant ideas from 100+ research papers for you --> gives it to your coding agent including trade-offs and implementation instructions. We had previously shown that this helps research work and want to know understand whether it helps everyday software engineering tasks. We built out 9 tasks to measure this and compared using only a Coding Agent (Opus 4.6) (baseline) vs Coding Agent + Paper Lantern access. (Blog post with full breakdown: https://www.paperlantern.ai/blog/coding-agent-benchmarks ) Some interesting results : 1. we asked the agent to write tests that maximize mutation score (fraction of injected bugs caught). The baseline caught 63% of injected bugs. Baseline + Paper Lantern found mutation-aware prompting from recent research (MuTAP, Aug 2023; MUTGEN, Jun 2025), which suggested enumerating every possible mutation via AST analysis and then writing tests to target each one. This caught 87%. 2. extracting legal clauses from 50 contracts. The baseline sent the full document to the LLM and correctly extracted 44% of clauses. Baseline + Paper Lantern found two papers from March 2026 (BEAVER for section-level relevance scoring, PAVE for post-extraction validation). Accuracy jumped to 76%. Five of nine tasks improved by 30-80%. The difference was technique selection. 10 of 15 most-cited papers across all experiments were published in 2025 or later. Everything is open source : https://github.com/paperlantern-ai/paper-lantern-challenges Each experiment has its own README with detailed results and an approach.md showing exactly what Paper Lantern surfaced and how the agent used it. Quick setup: `npx paperlantern@latest` Comments URL: https://news.ycombinator.com/item?id=47809920 Points: 3 # Comments: 4

hnrss.org · 2026-04-18 03:11:06+08:00 · tech

I've grown increasingly skeptical that public coding benchmarks tell me much about which model is actually worth paying for and worried that as demand continues to spike model providers will silently drop performance. I did a few manual analyses but found it non-trivial to compare across models due to difference in token caching and tool-use efficiency and so wanted a tool for repeatable evaluations. So the goal was an OSS tool get data to help answer questions like: “Would Sonnet have solved most of the issues we gave Opus? "How much would that have actually saved?” “What about OSS models like Kimi K2.5 or GLM-1?” “The vibes are off, did model performance just regress from last month?” Right now the project is a bit medium-rare - but it works end-to-end. I’ve run it successfully against itself, and I’m waiting for my token limits to reset so I can add support for more languages and do a broader run. I'm already seeing a few cases where I could've used 5.4-mini instead of 5.4 for some parts of implementation. I’d love any feedback, criticism, and ideas. I am especially interested if this is something you might pay for as a managed service or if you would contribute your private testcases to a shared commons hold-out set to hold AI providers a bit more accountable. https://repogauge.org [email protected] https://github.com/s1liconcow/repogauge Thanks! David Comments URL: https://news.ycombinator.com/item?id=47809457 Points: 1 # Comments: 0

hnrss.org · 2026-04-17 21:52:08+08:00 · tech

Once subagents start spawning other subagents, basic questions get hard to answer: what is running right now, what tool did it just call, did the child agent actually do what the parent asked. I wanted a way to verify that each agent is doing the work that fits its role, and to spot when a run goes off track. Lazyagent is a terminal TUI that collects events from Claude Code, Codex, and OpenCode and shows them in one place. It groups sessions from different runtimes by working directory, so Claude and Codex runs on the same repo appear under the same project. Features: - Filter events by type: tool calls, user prompts, session lifecycle, system events, or code changes only. - See which agent or subagent is responsible for each action. The agent tree shows parent-child relationships, so you can trace exactly what a spawned subagent did vs what the parent delegated. - View code diffs at a glance. Editing events render syntax-highlighted diffs inline, with addition/deletion stats. - Search across all events. You know a file was touched but not which agent did it -- type `/` and find it. - Check token usage per session. A single overlay shows cost, model calls, cache hit rate, per-model breakdowns, and which tools ran the most. - Watch a run in real time, or go back through a completed session to audit what happened. Please let me know if there's any feature you want! Comments URL: https://news.ycombinator.com/item?id=47805963 Points: 1 # Comments: 0

hnrss.org · 2026-04-17 05:53:00+08:00 · tech

Hey HN, I made agent-hub an open source tool that lets you talk to all your AI agents running locally or on remote machines. It works with setups where you already have agents running (Claude Code, Codex, Hermes, OpenClaw, etc.) and just want a simple way to access and use them in one place Why I built this: I run agents across a few remote machines + my local computer, and switching between them was painful. Existing tools like Conductor felt too tied to specific workflows (e.g. Git-based), and I couldn’t find anything that handled: - GTM tasks - Coding tasks - remote agents over SSH The vision: Build a mobile app to accompany this as well. I find myself talking to my agents on mobile as well. I am Omar and I vibe coded this over the past weekend :) Comments URL: https://news.ycombinator.com/item?id=47799990 Points: 1 # Comments: 0

hnrss.org · 2026-04-17 05:39:58+08:00 · tech

Hi HN I’ve been working on an open-source project to explore a problem I keep running into with LLM systems in production: We give models the ability to call tools, access data, and make decisions… but we don’t have a real runtime security layer around them. So I built a system that acts as a control plane for AI behavior, not just infrastructure. GitHub: https://github.com/dshapi/AI-SPM What it does The system sits around an LLM pipeline and enforces decisions in real time: Detects and blocks prompt injection (including obfuscation attempts) Forces structured tool calls (no direct execution from the model) Validates tool usage against policies Prevents data leakage (PII / sensitive outputs) Streams all activity for detection + audit Architecture (high-level) Gateway layer for request control Context inspection (prompt analysis + normalization) Policy engine (using Open Policy Agent) Runtime enforcement (tool validation + sandboxing) Streaming pipeline (Apache Kafka + Apache Flink) Output filtering before response leaves the system The key idea is: Treat the LLM as untrusted, and enforce everything externally What broke during testing Some things that surprised me: Simple pattern-based prompt injection detection is easy to bypass Obfuscated inputs (base64, unicode tricks) are much more common than expected Tool misuse is the biggest real risk (not the model itself) Most “guardrails” don’t actually enforce anything at runtime What I’m unsure about Would really appreciate feedback from people who’ve worked on similar systems: Is a general-purpose policy engine like OPA the right abstraction here? How are people handling prompt injection detection beyond heuristics? Where should enforcement actually live (gateway vs execution layer)? What am I missing in terms of attack surface? Why I’m sharing This space feels a bit underdeveloped compared to traditional security. We have CSPM, KSPM, etc… but nothing equivalent for AI systems yet. Trying to explore what that should look like in practice. Would love any feedback — especially critical takes. Comments URL: https://news.ycombinator.com/item?id=47799856 Points: 1 # Comments: 0

hnrss.org · 2026-04-17 00:03:04+08:00 · tech

Dear diary, this is my story: I'd been sharing MCP configs with other devs at work a lot - templates in shared repos, credentials in Bitwarden, everyone cowboying their own env vars. That's a lot of manual wiring and lack of any real control, so there was already a problem statement forming in my mind. Then three weeks ago I was putting my kids to sleep and reading about Jensen Huang saying every company will run 100 agents per employee, and the math started mathing. That evening I kept thinking about what agents actually need to operate in the real world and eventually landed on the same answer as every spy movie ever: basically, a passport suitable for the mission and clever drop-off locations. So I built STACK. True story. - The passport: a signed JWT (EdDSA) that proves which agent is acting, who authorized it, and what it's allowed to do. Works offline - any service can verify it without calling STACK. Agents can delegate to sub-agents but the scope ever only narrows. Max 4 hops. - The drop-off: is an encrypted handoff between agents. Agent A drops off a package with a JSON schema contract, encryption at rest, and a TTL. Agent B collects it, the custody transfers, and the payload gets deleted. Neither agent needs to trust the other. Just like in the movies! All credentials are KMS-encrypted. In proxy mode they are injected at the network boundary so the agents can make API calls through STACK without ever seeing the raw key. To try it, sign up at https://getstack.run , grab your API key, and connect: claude mcp add stack --transport http https://mcp.getstack.run/mcp --header "Authorization: Bearer YOUR_API_KEY" I want to provide a generous free tier and I hope people get value out of it. Keycard ($38M, a16z) does scoped agent credentials, Descope ($88M) does auth flows, Composio ($29M) does tool integrations. I'm a solo founder in Stockholm without funding, but I'm betting the full control plane is where the market is heading. I may be naive about that, but that's the bet. I like betting. Docs at https://getstack.run/docs . Comments URL: https://news.ycombinator.com/item?id=47795391 Points: 2 # Comments: 0

hnrss.org · 2026-04-16 19:07:51+08:00 · tech

I built HAINDY, a CLI that gives coding agents computer use across desktop, Android, and iOS. You install it as a normal CLI tool and can opt to install skills for it on Claude Code, Codex and OpenCode. It gives agents human-like control of devices. When the agent instructs it to click a button there's a screenshot + coordinate based computer use loop. No DOM or accessibility is used. I built it because I wanted agents to test things the way a human would, watching the screen and using it from a user perspective. Would love feedback on onboarding, real use cases, and whether this fits naturally into existing agent workflows. Comments URL: https://news.ycombinator.com/item?id=47791360 Points: 1 # Comments: 0