LLM as a judge

The "LLM as a Judge" Paradigm

The "LLM as a judge" paradigm designates the use of a Large Language Model (LLM) as an automated evaluator in the assessment of machine learning outputs. In this setting, results produced by a decision system are submitted to a third-party LLM equipped with an explicit evaluation prompt. The LLM then returns a quantitative score, a ranking, or a textual justification according to predefined evaluation criteria. This study investigates this paradigm in the context of evaluating clothing recommendation systems, a domain in which the definition of quality is highly subjective (Evidently AI, 2024).

Characterizing the Paradigm

LLMs, when applied as evaluators, offer the possibility of uninterrupted and large-scale assessment processes. This continuous operation contrasts with the limited availability and attention capacity of human judges. Previous work demonstrates that, with appropriate configuration and prompt engineering, LLMs can yield evaluations with reliability and reproducibility approximating that of average human annotators (Evidently AI, 2024). A salient benefit arises from the consistency of judgments provided by a single model, which is less susceptible to inter-rater variability inherent in human panels.

Prompt-based modulation further enables researchers to pivot between different axes of quality, such as context appropriateness, originality, or stylistic adherence (UniTnX Team, 2024). Notably, the use of LLMs as automated reviewers obviates the necessity for domain expertise or reference answers, facilitating scalable and cost-effective assessment pipelines. In addition, LLMs may accompany quantitative ratings with natural-language rationales, supporting transparency and ease of interpretation (Statistician in Stilettos, 2023).

Nevertheless, important limitations must be recognized. LLMs lack the rich multimodal contextual grounding characteristic of human evaluators, which can impair aesthetic or context-sensitive judgments (Stilettos, 2023). Their internal models of fashion, for instance, may lag behind current trends unless regularly fine-tuned on relevant data (Boise et al., 2024). The fidelity and informativeness of the evaluation process are also critically dependent on the structure and clarity of the prompt supplied to the LLM. Like all deep learning systems, LLMs may perpetuate or amplify training data biases; these risks are addressed separately in the bias evaluation section.

Experimental Protocol

The present study aims to compare the evaluative capabilities of an LLM with those of human judges in the assessment of outfit recommendations. We examine and contrast two recommendation systems: a classical algorithmic recommender (System A) and a generative AI-based method (System B). Synthetic user profiles—twenty in total—are constructed to vary in gender, age, stylistic preferences, and situational context. Each system generates a complete outfit for every user profile, resulting in matched sets of recommendations.

Evaluation proceeds according to four principal criteria: contextual relevance, aesthetic coherence, originality, and conformity to user preferences. Each criterion is rated on a five-point Likert scale ranging from "very unsatisfactory" to "very satisfactory." Both LLM-based and human assessments adhere to these criteria to ensure comparability.

As an illustrative example, the prompt used to instruct the LLM provides a user profile, a proposed outfit, and explicit definitions for each evaluation dimension. The model is instructed to assign a rating for each criterion and to justify its assessment via free text.

Evaluation and Analysis

The results of both recommendation systems are independently evaluated by human experts and the LLM-judge. Human experts utilize the same rubric as the LLM in order to facilitate direct comparison. Statistical analysis is performed to determine the correlation between LLM-provided scores and human evaluations. Discrepancies that exceed acceptable thresholds indicate the need for further refinement in model design, prompt formulation, or evaluation context.

In the analysis phase, mean scores for each system and evaluation criterion are compared. Additionally, the qualitative justifications produced by the LLM are examined to identify potential strengths and weaknesses in its judgment process.

Metrics and Interpretive Framework

Established metrics in the field of recommendation, such as Hit Rate@10 (Alsini et al., 2020) and the Normalized Discounted Cumulative Gain (NDCG; Wang et al., 2013), are employed to provide baseline quantitative assessments of system performance. Nevertheless, these metrics primarily capture the historical relevance and ranking effectiveness of recommendations, without adequately reflecting stylistic quality or user-perceived coherence. By contrast, subjective evaluations—whether supplied by LLMs or humans—offer insights into perceived quality and aesthetic appropriateness.

The combination of quantitative evaluation and subjective assessment thus affords a holistic view of recommendation system performance, encompassing both algorithmic efficacy and nuanced, user-centered quality.

References

Alsini, A., Huynh, D. Q., & Datta, A. (2020). Hit ratio: An Evaluation Metric for Hashtag Recommendation
Boise et al. (2024).
Evidently AI. (2024). Wrong but useful: an LLM-as-a-judge tutorial
Statistician in Stilettos. (2023).
UniTnX Team. (2024). Unitxt: Flexible data preparation and evaluation for Generative AI
Wang, Y., Wang, L., Li, Y., He, D., & Liu, T.Y. (2013). Learning to Rank by Optimizing NDCG

Paul Louppe