The Meta research team has unveiled an experimental framework called MLGym and MLGym-Bench to train and evaluate AI research agents on various AI research tasks. This comes days after Google unveiled its co-scientist system.
The researchers mention that it is the first machine-learning Gym environment, enabling research on reinforcement learning algorithms for training AI research agents.
As per the research paper, MLGym-Bench consists of 13 diverse and open-ended AI research tasks from various domains, including computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analysing the results, and iterating through this process to improve on a given task.
“Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks,” the research team mentioned.
“We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements,” the researchers added. With this framework, they aim to empower LLM agents capable of independently generating scientific hypotheses, writing scientific papers, analysing results, and more.
The MLGym framework is designed to be modular and extensible, allowing researchers to easily add new tasks, datasets, and tools.
The framework also provides a default agentic harness that can be used to evaluate any base model. It is interesting to note that the research team evaluated several LLMs on the MLGym-Bench benchmark, including Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. As per their tests, they say Gemini-1.5 Pro is the most cost-effective option in the context of research.
The team hopes that open-sourcing the framework and benchmark will facilitate future research and advance the AI research capabilities of LLM agents. You can find the code on its GitHub page.
While we know the future of research is agentic, these frameworks and benchmarks will help make a difference by providing a standardised way to evaluate and compare AI research agents.