AgentBench: Evaluating LLMs as Agents

AgentBench: Evaluating LLMs as Agents

Evaluating Large Language Models (LLMs) as agents in interactive environments, highlighting the performance gap between API-based and open-source models, and introducing the AgentBench benchmark.

arxiv.org

Saved by Darren LI

LLM Powered Autonomous Agents

Lilian Wenglilianweng.github.io

GitHub - kingjulio8238/memary: Longterm Memory for Autonomous Agents.

GitHub - confident-ai/deepeval: The LLM Evaluation Framework