Research preprint · 2026

ModelLens

Finding the best for your task from Myriads of Models

Rui Cai, Weijie Jacky Mo, Xiaofei Wen, Qiyao Ma,

Wenhui Zhu, Xiwen Chen§, Muhao Chen, Zhe Zhao

University of California, Davis · Arizona State University · §Morgan Stanley

ruicai@ucdavis.edu

Learned model–dataset atlas (left) and a model recommendation example for MMMU (right).
Left: the learned model–dataset atlas — every model and every dataset embedded in a single space, organised by family and domain. Right: for an unseen target dataset (MMMU), ModelLens returns multimodal-LM candidates such as Gemini-2.5-Pro and Qwen3-VL-235B — in stark contrast to text-similarity neighbours such as DeBERTa-MNLI.

Motivation

Hundreds of thousands of models. Which one should you use?

HuggingFace alone hosts hundreds of thousands of pretrained models, and new ones appear every day. For any new task, the very first question — "which model should I use?" — has become genuinely hard.

Existing answers each fall short:

  • Fine-tune-and-rank needs a forward pass per candidate — infeasible at scale.
  • Transferability estimation still requires per-candidate compute on the target.
  • Model routing assumes a tiny, hand-curated pool of ~5–30 models.
  • Text-embedding retrieval matches descriptions, not behaviour.

ModelLens learns model–dataset compatibility directly from 1.62M public leaderboard records, then ranks unseen models on unseen datasets zero-shot — using only metadata.

By the numbers

1.62M

eval records

47K

models

9.6K

datasets


+21–81%

improvement on five representative routers across QA benchmarks, using ModelLens-recommended candidate pools.

Method

A three-stage pipeline

From noisy public leaderboard records to zero-shot recommendations for unseen datasets and unseen models.

Stage 1

Collect interactions

Aggregate large-scale model–dataset evaluations from public leaderboards and curated sources into a unified (model, dataset, task, metric, score) corpus.

Stage 2

Learn compatibility

A multi-view ranker fuses learned IDs, name tokens, model-card descriptions, size buckets and architecture families — trained with a listwise + pairwise + pointwise objective and ID dropout for cold-start generalisation.

Stage 3

Recommend candidates

Given a new task or dataset (text + metadata), ModelLens returns a ranked Top-K candidate pool — no forward passes on the target task required.

What it learns

The model–dataset atlas

Same projection, different latent sources. Text similarity gives a tangle; interaction-aware learning recovers family and domain structure on its own.

Atlas — semantic-only baseline

Semantic-only baseline

Frozen text-embedding similarity between model cards and dataset descriptions — what a metadata-only retriever sees. Families heavily overlap in the centre.

Atlas — full ModelLens learned latents

ModelLens ours

Same projection, learned latents. Speech (orange) detaches cleanly; retrieval embedders form their own arc; vision and multimodal models bridge the text–vision boundary.

Live demo

Try ModelLens in your browser

Type a task description, and ModelLens returns a ranked pool of candidate foundation models from the open-source ecosystem.

🤗 Open on HuggingFace Space

If the embed is slow to wake up, the Space may be cold-starting from sleep.

Examples

Top-K candidates, at a glance

A few illustrative recommendations across task domains. The live demo returns full ranked Top-K pools with scores.

Multi-hop QA

HotpotQA

Text
  1. 1 LLaMA-3.3-70B 70B
  2. 2 GPT-OSS-20B 20B
  3. 3 Qwen3-32B 32B

Multimodal QA

MMMU

Vision-Language
  1. 1 Gemini-2.5-Pro VLM
  2. 2 Qwen3-VL-235B 235B
  3. 3 Step-3-VL-108B 108B

Math reasoning

GSM8K

Math
  1. 1 Qwen2.5-Math-72B 72B
  2. 2 DeepSeek-Math-7B 7B
  3. 3 LLaMA-3.1-70B-Instruct 70B

Code generation

HumanEval

Code
  1. 1 DeepSeek-Coder-V2 236B
  2. 2 Qwen2.5-Coder-32B 32B
  3. 3 Code-LLaMA-70B 70B

Resources

Find ModelLens online

Cite

Citation

If you find ModelLens useful in your research, please cite:

@article{cai2026modellens,
  title   = {{ModelLens}: Finding the Best for Your Task from Myriads of Models},
  author  = {Cai, Rui and Mo, Weijie Jacky and Wen, Xiaofei and Ma, Qiyao
             and Zhu, Wenhui and Chen, Xiwen and Chen, Muhao and Zhao, Zhe},
  journal = {arXiv preprint},
  year    = {2026}
}