Is there an established benchmark for building a full product?
- SWE-bench leaderboard: https://www.swebench.com/
- Which metrics for e.g. "SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork"? https://news.ycombinator.com/item?id=43101314
- MetaGPT, MGX: https://github.com/FoundationAgents/MetaGPT :
> Software Company as Multi-Agent System
> MetaGPT takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc.
Internally, MetaGPT includes product managers / architects / project managers / engineers. It provides the entire process of a software company along with carefully orchestrated SOPs.
- Mutation-Guided LLM-based Test Generation: https://news.ycombinator.com/item?id=42953885
- https://news.ycombinator.com/item?id=41333249 :
- codefuse-ai/Awesome-Code-LLM > Analysis of AI-Generated Code, Benchmarks: https://github.com/codefuse-ai/Awesome-Code-LLM :
> 8.2 Benchmarks:
Integrated Benchmarks,
Evaluation Metrics,
Program Synthesis,
Visually Grounded Program, Synthesis,
Code Reasoning and QA,
Text-to-SQL,
Code Translation,
Program Repair,
Code Summarization,
Defect/Vulnerability Detection,
Code Retrieval,
Type Inference,
Commit Message Generation,
Repo-Level Coding
- underlines/awesome-ml/tools.md > Benchmarking: https://github.com/underlines/awesome-ml/blob/master/llm-too...
- formal methods workflows, coverage-guided fuzzing: https://news.ycombinator.com/item?id=40884466
- "Large Language Models Based Fuzzing Techniques: A Survey" (2024) https://arxiv.org/abs/2402.00350