Skip to content

trustpilot/tp-cluster-based-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TP Cluster-Based Search (MSRD Experiments)

This repository contains code and data used for experiments on cluster-based search and intent-aware reranking for the MSRD dataset.

Project Overview

  • Goal: Evaluate cluster-based search strategies (including Trustpilot Cluster-Based Search — TP-CBS) and intent-aware reranking on the MSRD movie recommendation / retrieval dataset.
  • Approach: Precompute movie and cluster embeddings, run multiple retrieval methods (BM25, embedding cosine, hybrid, cluster-based reranking, LLM-based relevance scoring), compute metrics (NDCG, latency), and compare methods per intent.

Repository Structure

  • README.md — this file.
  • requirements.txt — Python dependencies required to run scripts.
  • data/MSRD/ — dataset and derived files used by experiments:
    • movies.jsonl — movie records (JSON Lines).
    • movies_with_clusters.csv — movie metadata joined with cluster assignments.
    • movie_clusters.csv — cluster assignments for movies.
    • queries.csv — query set used in evaluation.
    • query_intent.csv — ground-truth query intents.
    • genres.csv — genre metadata and descriptions.
    • msrd_relevance_predictions.csv — LLM-based relevance scores for (query, movie) pairs; computed using the Gemini model with prompts from the prompts/ directory.
  • prompts/ — LLM prompts used for obtaining relevancy scores through language models (system instructions and prompt templates for the Gemini-based relevance scorer).
  • scripts/ — command-line scripts used in experiments:
    • 01_query_intent_classifier.py — training and inference routines for the query intent classifier used for intent-aware routing.
    • 02_gemini_rel_scores_MSRD.py — computes LLM-based relevance scores via direct Gemini API calls and saves results to CSV.
    • 03_benchmark_MSRD.py — runs full end-to-end benchmarks (loading data, retrieval, reranking, and metric computation).

Detailed Script Purposes

  • scripts/01_query_intent_classifier.py

    • Purpose: Infer query intent labels used to select intent-aware reranking policies in experiments.
    • Authentication: controlled by the USE_VERTEX flag at the top of the script (default: False).
      • USE_VERTEX = False (default): set the GOOGLE_API_KEY environment variable.
      • USE_VERTEX = True: set PROJECT and LOCATION in the script and authenticate via gcloud.
    • Example:
      export GOOGLE_API_KEY="your_key_here"
      python scripts/01_query_intent_classifier.py --dataset msrd
  • scripts/02_gemini_rel_scores_MSRD.py

    • Purpose: Compute soft relevance scores for (query, movie) pairs using the Gemini API. Results are saved directly to data/MSRD/msrd_relevance_predictions.csv for use with script 03.
    • Authentication: controlled by the USE_VERTEX flag at the top of the script (default: False).
      • USE_VERTEX = False (default): set the GOOGLE_API_KEY environment variable.
      • USE_VERTEX = True: set VERTEX_PROJECT and VERTEX_LOCATION in the script and authenticate via gcloud.
    • Example (Gemini API key):
      export GOOGLE_API_KEY="your_key_here"
      python scripts/02_gemini_rel_scores_MSRD.py
  • scripts/03_benchmark_MSRD.py

    • Purpose: End-to-end experiment driver. Loads MSRD data, computes or loads embeddings, runs retrieval methods (BM25, embedding cosine, hybrid, TP-CBS), and reports evaluation metrics (NDCG@k, latency, summary tables used in the paper).
    • Example:
      python scripts/03_benchmark_MSRD.py --data-dir data/MSRD --output results/benchmark.json

Quickstart

  1. Create and activate a virtual environment: A Python virtual environment isolates project dependencies from your system Python installation.

    python3 -m venv .venv
    source .venv/bin/activate
  2. Install dependencies: Install all required packages listed in requirements.txt:

    pip install -r requirements.txt
  3. Prepare data: Ensure MSRD files are present under data/MSRD/. The scripts assume the filenames listed above.

    Important: This repository already contains all data necessary to reproduce the benchmarks. You can skip directly to step 4. If you wish to recreate the intermediate data (query intents and relevance scores), see Optional Data Recreation below.

  4. Run the main benchmark:

    python scripts/03_benchmark_MSRD.py --data-dir data/MSRD --output results/benchmark.json

Optional: Recreate Intermediate Data

If you wish to recreate query intents and relevance predictions from scratch:

  1. Predict query intents:

    export GOOGLE_API_KEY="your_key_here"
    python scripts/01_query_intent_classifier.py --dataset msrd
  2. Compute LLM-based relevance scores:

    export GOOGLE_API_KEY="your_key_here"
    python scripts/02_gemini_rel_scores_MSRD.py

Reproducibility & Notes

  • Hardware: Experiments were conducted on an Apple M3 PRO with 36GB RAM.
  • Set random seeds in scripts to reproduce results.
  • Use the versions in requirements.txt to ensure a consistent environment.
  • For GPU runs, install an appropriate CUDA-enabled PyTorch build and select devices via script options or environment variables.
  • For full reproducibility, use scripts/03_benchmark_MSRD.py as the primary entry point.

Development & Extensions

  • To swap in a different reranker or retrieval model, adapt the corresponding section in scripts/03_benchmark_MSRD.py.

Citation

If you use this code in your research, please cite the accompanying ECML PKDD 2026 submission and include a link to this repository.

About

Cluster-based and intent-aware retrieval experiments on the MSRD benchmark, implementing and evaluating the TPCBS architecture.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages