TP Cluster-Based Search (MSRD Experiments)

This repository contains code and data used for experiments on cluster-based search and intent-aware reranking for the MSRD dataset.

Project Overview

Goal: Evaluate cluster-based search strategies (including Trustpilot Cluster-Based Search — TP-CBS) and intent-aware reranking on the MSRD movie recommendation / retrieval dataset.
Approach: Precompute movie and cluster embeddings, run multiple retrieval methods (BM25, embedding cosine, hybrid, cluster-based reranking, LLM-based relevance scoring), compute metrics (NDCG, latency), and compare methods per intent.

Repository Structure

README.md — this file.
requirements.txt — Python dependencies required to run scripts.
data/MSRD/ — dataset and derived files used by experiments:
- movies.jsonl — movie records (JSON Lines).
- movies_with_clusters.csv — movie metadata joined with cluster assignments.
- movie_clusters.csv — cluster assignments for movies.
- queries.csv — query set used in evaluation.
- query_intent.csv — ground-truth query intents.
- genres.csv — genre metadata and descriptions.
- msrd_relevance_predictions.csv — LLM-based relevance scores for (query, movie) pairs; computed using the Gemini model with prompts from the prompts/ directory.
prompts/ — LLM prompts used for obtaining relevancy scores through language models (system instructions and prompt templates for the Gemini-based relevance scorer).
scripts/ — command-line scripts used in experiments:
- 01_query_intent_classifier.py — training and inference routines for the query intent classifier used for intent-aware routing.
- 02_gemini_rel_scores_MSRD.py — computes LLM-based relevance scores via direct Gemini API calls and saves results to CSV.
- 03_benchmark_MSRD.py — runs full end-to-end benchmarks (loading data, retrieval, reranking, and metric computation).

Detailed Script Purposes

scripts/01_query_intent_classifier.py
- Purpose: Infer query intent labels used to select intent-aware reranking policies in experiments.
- Authentication: controlled by the USE_VERTEX flag at the top of the script (default: False).
  - USE_VERTEX = False (default): set the GOOGLE_API_KEY environment variable.
  - USE_VERTEX = True: set PROJECT and LOCATION in the script and authenticate via gcloud.
- Example:
```
export GOOGLE_API_KEY="your_key_here"
python scripts/01_query_intent_classifier.py --dataset msrd
```
scripts/02_gemini_rel_scores_MSRD.py
- Purpose: Compute soft relevance scores for (query, movie) pairs using the Gemini API. Results are saved directly to data/MSRD/msrd_relevance_predictions.csv for use with script 03.
- Authentication: controlled by the USE_VERTEX flag at the top of the script (default: False).
  - USE_VERTEX = False (default): set the GOOGLE_API_KEY environment variable.
  - USE_VERTEX = True: set VERTEX_PROJECT and VERTEX_LOCATION in the script and authenticate via gcloud.
- Example (Gemini API key):
```
export GOOGLE_API_KEY="your_key_here"
python scripts/02_gemini_rel_scores_MSRD.py
```
scripts/03_benchmark_MSRD.py
- Purpose: End-to-end experiment driver. Loads MSRD data, computes or loads embeddings, runs retrieval methods (BM25, embedding cosine, hybrid, TP-CBS), and reports evaluation metrics (NDCG@k, latency, summary tables used in the paper).
- Example:
```
python scripts/03_benchmark_MSRD.py --data-dir data/MSRD --output results/benchmark.json
```

Quickstart

Create and activate a virtual environment: A Python virtual environment isolates project dependencies from your system Python installation.
```
python3 -m venv .venv
source .venv/bin/activate
```
Install dependencies: Install all required packages listed in requirements.txt:
```
pip install -r requirements.txt
```
Prepare data: Ensure MSRD files are present under data/MSRD/. The scripts assume the filenames listed above.

Important: This repository already contains all data necessary to reproduce the benchmarks. You can skip directly to step 4. If you wish to recreate the intermediate data (query intents and relevance scores), see Optional Data Recreation below.

Run the main benchmark:

python scripts/03_benchmark_MSRD.py --data-dir data/MSRD --output results/benchmark.json

Optional: Recreate Intermediate Data

If you wish to recreate query intents and relevance predictions from scratch:

Predict query intents:

export GOOGLE_API_KEY="your_key_here"
python scripts/01_query_intent_classifier.py --dataset msrd

Compute LLM-based relevance scores:

export GOOGLE_API_KEY="your_key_here"
python scripts/02_gemini_rel_scores_MSRD.py

Reproducibility & Notes

Hardware: Experiments were conducted on an Apple M3 PRO with 36GB RAM.
Set random seeds in scripts to reproduce results.
Use the versions in requirements.txt to ensure a consistent environment.
For GPU runs, install an appropriate CUDA-enabled PyTorch build and select devices via script options or environment variables.
For full reproducibility, use scripts/03_benchmark_MSRD.py as the primary entry point.

Development & Extensions

To swap in a different reranker or retrieval model, adapt the corresponding section in scripts/03_benchmark_MSRD.py.

Citation

If you use this code in your research, please cite the accompanying ECML PKDD 2026 submission and include a link to this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TP Cluster-Based Search (MSRD Experiments)

Project Overview

Repository Structure

Detailed Script Purposes

Quickstart

Optional: Recreate Intermediate Data

Reproducibility & Notes

Development & Extensions

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data/MSRD		data/MSRD
prompts		prompts
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TP Cluster-Based Search (MSRD Experiments)

Project Overview

Repository Structure

Detailed Script Purposes

Quickstart

Optional: Recreate Intermediate Data

Reproducibility & Notes

Development & Extensions

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages