Programmatically Evaluate Search Results with Anton | Demos

What’s the big idea?

Empower any developer to judge the quality of a set of search results — from localhost to massive production search systems — with near-human quality.

What is Anton?

Anton is the first in a suite of AI agents we’re building that are specifically trained for enabling great search experiences. For website and app developers, delivering relevant search results is crucial to driving business success. We’ve seen over and over that relevance is correlated with online business metrics like engagement, conversion rates, and revenue. You can read more about the launch over on the blog.

Evaluating search relevance at scale, however, is challenging and time-consuming. The process typically involves engineers or search quality raters spending hours reviewing hundreds or thousands of results. Or, for the busiest teams, it means a quick spot check “looks good to me” that very often leaves serious gaps in search quality.

What’s a relevance judgement?

In relevance judgement, a judge — typically a trained person — classifies a search result as relevant or non-relevant with respect to a user intent. The ‘intent’ is typically only available in the search user’s mind, and needs to be inferred from the query (and sometimes also the context). As humans we’re able to unpack this intent from a combination of semantically understanding the meaning of a search query (”coat with big buttons”), as well as the social context (”dress shirt for a formal wedding”).

The relevance judgement process assigns grades that can be binary (good or bad) or graded in a multi-point scale (1-5 stars), or in a ternary scale (great, neutral, bad). Ternary scales are common in information retrieval shared tasks such as TREC. Recent research (Zuang et al., (2024)) has shown that asking for nuanced relevance levels, such as “somewhat relevant”, leads to better results compared to binary relevance assessments.

This process of relevance judgement is applied to a large enough sample to provide trustworthy results, and results in a database of grades that we can use to compute efficacy metrics such as Precision, Recall, and NDCG. These metrics serve as an important north star for teams building search or monitoring the evolution of their search systems over time.

Why Anton Over Direct LLM Usage?

You might wonder, “Why not just use GPT-4 directly?” That was the first approach we tried. We utilized a prompt similar to RankGPT (which won the Outstanding Paper Award at EMNLP 2023) and applied it to GPT-4V, one of the best general-purpose LLMs available. However, achieving reliable, near-human-level results—without issues like timeouts or parsing errors—proved to be beyond the reach of simple prompt engineering. That’s why we built Anton.

If you’re interested in a deeper dive into the basics of search evaluation, oldie but goldie is Introduction to Information Retrieval by Manning, Raghavan, and Schütze (2008).

Building judgement around an LLM foundation

Historically, relevance judgements have relied on human evaluators. However, the rise of Large Language Models (LLMs) like GPT-4 has shifted the landscape. These models have demonstrated impressive zero-shot performance on a variety of NLP tasks, including relevance judgements, offering a scalable alternative to traditional methods (Zuang et al., (2024)).

So you're probably wondering — why not just use GPT-4 myself? That's where we started, too. We used a prompt similar to RankGPT (which won the Outstanding Paper Award at EMNLP 2023) and used it with GPT-4V, one of the best general-purpose LLMs available. The results were already impressive, but not optimal.

We tried prompt engineering, in-context learning, evaluated every LLM we could get our hands on, compared the economics & ROI. With so much technology developing so quickly, the challenge is developer paralysis from the sheer number of ways to get to a similar outcome.

We arrived at a solution we like a lot — and we'd love to save search teams like you from having to go through the R&D pain we did when you need to use the relevance judgements tool in your toolbelt. So, we built Anton.

Near-human quality, at software scale.

We wanted something that would perform at near-human quality, and with the kind of speed and scale that unlocks iteration & experimentation. Anton has allowed the team here at Objective HQ to focus our trained experts on tasks that require the most human precision, and let the team do things like evaluate huge chunks of search logs in minutes to find issues & patterns. The more you measure, the more you can improve. And the faster you measure, the faster your team can move.

	Feature	Anton	Direct LLM Usage
Grades/min	1.4	10,000	500
% of Human Quality	100%	92%	77%
Cohen’s Kappa Coefficient	0.67	0.62	0.53
Cost per 1k Grades	~$420	~$3	~$5

This table shows how Anton and Naive GPT-4 (GPT-4 using a prompt similar to RankGPT) compare with human judges using Cohen's Kappa Coefficient across on three datasets: Fashion, Hotel Supplies, and Design. The Fashion dataset is a subset of the publicly available dataset H&M Personalized Fashion Recommendations. The Hotel Supplies and Design datasets are proprietary and represent domains in e-commerce search for hotel supply products, and social media search for design assets, respectively. You can find the corresponding judgements for the H&M Fashion data here in Github.

We routinely use Anton to compare different search solutions, to make informed decisions on the search experiences we are providing our customers. And now you can too! You can make an account and get started in just a few minutes today.