I've grown increasingly skeptical that public coding benchmarks tell me much about which model is actually worth paying for and worried that as demand continues to spike model providers will silently drop performance. I did a few manual analyses but found it non-trivial to compare across models due to difference in token caching and tool-use efficiency and so wanted a tool for repeatable evaluations. So the goal was an OSS tool get data to help answer questions like: “Would Sonnet have solved most of the issues we gave Opus? "How much would that have actually saved?” “What about OSS models like Kimi K2.5 or GLM-1?” “The vibes are off, did model performance just regress from last month?” Right now the project is a bit medium-rare - but it works end-to-end. I’ve run it successfully against itself, and I’m waiting for my token limits to reset so I can add support for more languages and do a broader run. I'm already seeing a few cases where I could've used 5.4-mini instead of 5.4 for some parts of implementation. I’d love any feedback, criticism, and ideas. I am especially interested if this is something you might pay for as a managed service or if you would contribute your private testcases to a shared commons hold-out set to hold AI providers a bit more accountable. https://repogauge.org [email protected] https://github.com/s1liconcow/repogauge Thanks! David Comments URL: https://news.ycombinator.com/item?id=47809457 Points: 1 # Comments: 0
Now that Spurs have dropped into the relegation zone, there is a dawning realization that one of the Premier League's Big Six might actually go down.