Relevance assessment: are judges exchangeable and does it matter?Dr Paul Thomas Tuesday 9th September 2008 at 11am
AbstractReusable test collections for information retrieval rely on relevance judgements, decisions on which documents are good answers to each query in the test. We investigate to what extent people making relevance judgements are interchangeable. Analysis shows low levels of agreement between judges, and we report on experiments to determine if this is sufficient to invalidate the use of a test collection. We find that both system scores and system rankings are subject to consistent but small differences across different relevance judges. It appears that collections are not completely robust to changes of judge when these judges vary widely in task and topic expertise. This has implications for anyone building a reusable test collection. (This is joint work with colleagues at NIST, Microsoft, and CWI.) Short resumePaul is a postdoctoral fellow at the CSIRO ICT Centre, working largely on problems in distributed and personal information retrieval. He completed his PhD, at the ANU, earlier this year. |