AgentBench: Evaluating LLMs as Agents

Evaluating Large Language Models (LLMs) as agents in interactive environments, highlighting the performance gap between API-based and open-source models, and introducing the AgentBench benchmark.

arxiv.org

Saved by Darren LI

RelatedCollectionsHighlightsNotes

GitHub - microsoft/TinyTroupe: LLM-powered multiagent persona simulation for imagination enhancement and business insights.

github.com

The community remains puzzled about whether these models genuinely generalize to unseen tasks, or seemingly succeed by memorizing the training data. This paper makes important strides in addressing this question. It constructs a suite of carefully designed counterfactual evaluations, providing fresh insights into the capabilities of state-of-the-ar... See more

Zhaofeng Wu • Reasoning skills of large language models are often overestimated

Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by

ANY

LLM of your choice, statistical methods, or NLP models that runs

locally on your machine

:

G-Eval

Summarization

Answer Relevancy

Faithfulness

Contextual Recall

Contextual Precision

RAGAS

Hallucination

Toxicity

Bias

etc.

GitHub - confident-ai/deepeval: The LLM Evaluation Framework

Building production-ready LLM-powered applications is currently very difficult. It involves countless iterations of prompt engineering, parameter tuning, and architectures.

Agenta provides you with the tools to quickly do prompt engineering and 🧪 experiment , ⚖️ evaluate , and 🚀 deploy your LLM apps. All without imposing any restrictions on your

GitHub - microsoft/TinyTroupe: LLM-powered multiagent persona simulation for imagination enhancement and business insights.

Zhaofeng Wu • Reasoning skills of large language models are often overestimated

GitHub - confident-ai/deepeval: The LLM Evaluation Framework

Testing framework for LLM Part