Skip to content

jparkerweb/fast-topic-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

35 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿท๏ธ Fast Topic Analysis

A tool for analyzing text against predefined topics using average weight embeddings and cosine similarity.

banner

Maintained by

eQuill Labs

Overview

Fast Topic Analysis is a powerful tool for identifying topic matches in text with high precision. It uses embedding-based semantic analysis with an advanced clustering approach to detect nuanced topic variations.

Key features:

  • Multiple Embeddings Per Topic: Creates several weighted average embeddings for each topic instead of a single representation, capturing different semantic variations
  • Embedding Clustering: Groups similar phrases within topics to form coherent semantic clusters using agglomerative or HDBSCAN algorithms
  • Cohesion & Silhouette Scoring: Measures cluster quality via per-cluster cohesion and global silhouette score
  • Configurable Precision: Offers preset configurations for different use cases (high precision, balanced, performance)
  • Fast Processing: Optimized for efficient text analysis with minimal processing time
  • Powered by embedding-utils: All vector math, clustering, similarity, and embedding generation provided by the embedding-utils library

The project has two main .js files:

  1. A generator (generate.js) that creates topic embeddings from training data
  2. An interactive demo (run-demo.js) that analyzes text against these topic embeddings

New Feature: Embedding Clustering and Cohesion Score

The tool now supports clustering of embeddings within each Clustering Visualization

Setup

Install dependencies:

npm install

Usage

Generating Topic Embeddings

node generate.js

This will:

  • Clean the data/topic_embeddings directory
  • Process training data from data/training_data.jsonl
  • Generate embeddings for each topic defined in labels-config.js
  • Cluster similar embeddings within each topic
  • Save multiple embeddings per topic as JSON files in data/topic_embeddings/

Clustering Configuration

You can customize the clustering behavior using command-line arguments:

# Use a predefined configuration preset
node generate.js --preset high-precision

# Customize individual parameters
node generate.js --similarity-threshold 0.92 --max-clusters 3

Available presets:

  • high-precision: Optimized for maximum accuracy with more granular clusters
    • CLUSTERING_SIMILARITY_THRESHOLD=0.95
    • CLUSTERING_MIN_CLUSTER_SIZE=3
    • CLUSTERING_MAX_CLUSTERS=8
  • balanced: Default settings for good precision and performance
    • CLUSTERING_SIMILARITY_THRESHOLD=0.9
    • CLUSTERING_MIN_CLUSTER_SIZE=5
    • CLUSTERING_MAX_CLUSTERS=5
  • performance: Optimized for speed with fewer clusters
    • CLUSTERING_SIMILARITY_THRESHOLD=0.85
    • CLUSTERING_MIN_CLUSTER_SIZE=10
    • CLUSTERING_MAX_CLUSTERS=3
  • legacy: Disables clustering for backward compatibility
    • ENABLE_CLUSTERING=false

Command-line options for generate.js:

  • --preset, -p <name>: Use a predefined configuration preset
  • --enable-clustering <bool>: Enable or disable clustering (true/false)
  • --similarity-threshold <num>: Set similarity threshold for clustering (0-1)
  • --min-cluster-size <num>: Set minimum cluster size
  • --max-clusters <num>: Set maximum number of clusters per topic
  • --algorithm <name>: Clustering algorithm to use (default or hdbscan)
  • --incremental: Update clusters incrementally (new JSONL entries only)
  • --help: Show help message

For more options, run:

node generate.js --help

Incremental Updates

After an initial full generation, you can incrementally update clusters when new training phrases are added:

  1. Append new phrases to data/training_data.jsonl
  2. Run incremental generation:
node generate.js --incremental

This will:

  • Validate the manifest (data integrity + model consistency)
  • Embed only the new phrases
  • Assign each new embedding to the nearest existing cluster using assignToCluster()
  • Update cluster centroids via incremental weighted averaging
  • Update the manifest for future incremental runs

Incremental mode requires a prior full generation (which creates a manifest file). If training data has been edited (not just appended), or the model/precision has changed, a full regeneration is required.

Running Analysis

node run-demo.js

The test runner provides an interactive interface to:

  1. Choose logging verbosity
  2. Optionally show matched sentences if verbose logging is disabled
  3. Select a test message file to analyze

You can also specify a test message directly:

node run-demo.js 1
node run-demo.js message-1.txt

Command-line options for run-demo.js:

  • --verbose, -v: Enable verbose logging
  • --quiet, -q: Disable verbose logging
  • --show-matches, -s: Show matched sentences
  • --hide-matches, -h: Hide matched sentences
  • --help: Show help message

Configuration preferences (last used file, verbosity, etc.) are automatically saved in run-demo-config.json.

๐Ÿšจ First Run Model Download

The first time a model is used (e.g. generate.js or run-demo.js), it will be downloaded and cached to the directory speciifed in .env. All subsequent runs will be fast as the model will be loaded from the cache.

Output

The analysis will show:

  • Similarity scores between the test text and each topic cluster
  • Which specific cluster matched each sentence
  • Execution time
  • Total comparisons made
  • Number of matches found
  • Model information

File Structure

โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ training_data.jsonl          # Training data
โ”‚   โ”œโ”€โ”€ incremental-manifest.json    # Incremental processing state
โ”‚   โ””โ”€โ”€ topic_embeddings/            # Generated embeddings
โ”œโ”€โ”€ test-messages/                   # Test files
โ”œโ”€โ”€ modules/
โ”‚   โ”œโ”€โ”€ embedding.js                 # Thin wrapper around embedding-utils provider
โ”‚   โ”œโ”€โ”€ manifest.js                  # Incremental manifest operations
โ”‚   โ””โ”€โ”€ utils.js                     # Utility functions (toBoolean)
โ”œโ”€โ”€ test/
โ”‚   โ”œโ”€โ”€ cluster-test.js              # Unit tests for clustering
โ”‚   โ”œโ”€โ”€ v030-features-test.js        # Tests for EU v0.3.0 features (assignToCluster, silhouetteScore, HDBSCAN, Float32Array)
โ”‚   โ”œโ”€โ”€ manifest-test.js             # Unit tests for manifest module
โ”‚   โ”œโ”€โ”€ incremental-integration-test.js  # Integration test for incremental gen
โ”‚   โ””โ”€โ”€ incremental-edge-cases-test.js   # Edge case tests for incremental mode
โ”œโ”€โ”€ generate.js                      # Embedding generator
โ”œโ”€โ”€ run-demo.js                      # Interactive analysis demo
โ””โ”€โ”€ labels-config.js                 # Topic definitions

Customizing

Model Settings

Change the model settings in .env to use different embedding models and configurations:

# Model and precision
ONNX_EMBEDDING_MODEL="Xenova/all-MiniLM-L12-v2"
ONNX_EMBEDDING_MODEL_PRECISION=fp32

# Available Models and their configurations:
# | Model                                        | Precision      | Size                   | Requires Prefix | Data Prefix     | Search Prefix |
# | -------------------------------------------- | -------------- | ---------------------- | --------------- | --------------- | ------------- |
# | Xenova/all-MiniLM-L6-v2                      | fp32, fp16, q8 | 90 MB, 45 MB, 23 MB    | false           | null            | null          |
# | Xenova/all-MiniLM-L12-v2                     | fp32, fp16, q8 | 133 MB, 67 MB, 34 MB   | false           | null            | null          |
# | Xenova/paraphrase-multilingual-MiniLM-L12-v2 | fp32, fp16, q8 | 470 MB, 235 MB, 118 MB | false           | null            | null          |
# | nomic-ai/modernbert-embed-base               | fp32, fp16, q8 | 568 MB, 284 MB, 146 MB | true            | search_document | search_query  |

Clustering Configuration

Configure clustering behavior in .env:

Variable Description Default Example
ENABLE_CLUSTERING Enable or disable clustering functionality true ENABLE_CLUSTERING=true
CLUSTERING_ALGORITHM Clustering algorithm (default or hdbscan) default CLUSTERING_ALGORITHM=hdbscan
CLUSTERING_SIMILARITY_THRESHOLD Threshold for considering embeddings similar (0-1) 0.9 CLUSTERING_SIMILARITY_THRESHOLD=0.85
CLUSTERING_MIN_CLUSTER_SIZE Minimum number of phrases per cluster 5 CLUSTERING_MIN_CLUSTER_SIZE=3
CLUSTERING_MAX_CLUSTERS Maximum number of clusters per topic 5 CLUSTERING_MAX_CLUSTERS=8

Example configuration:

# Clustering Configuration
ENABLE_CLUSTERING=true
CLUSTERING_ALGORITHM=default
CLUSTERING_SIMILARITY_THRESHOLD=0.9
CLUSTERING_MIN_CLUSTER_SIZE=5
CLUSTERING_MAX_CLUSTERS=5

Other Customizations

  • Change the thresholds defined in labels-config.js per topic to change the similarity score that triggers a match.
  • Add more test messages to the test-messages directory to test against.
  • Add more training data to data/training_data.jsonl to improve the topic embeddings.

Task Instruction Prefixes

Some models require specific prefixes to optimize their performance for different tasks. When a model has Requires Prefix: true, you must use the appropriate prefix:

  • Data Prefix: Used when generating embeddings from training data
  • Search Prefix: Used when generating embeddings for search/query text

For example, nomic-ai/modernbert-embed-base requires:

  • search_document prefix for training data
  • search_query prefix for search queries

Models with Requires Prefix: false will ignore any prefix settings.

Training Data

The training data is a JSONL file that contains the training data. Each line is a JSON object with the following fields:

  • text: The text to be analyzed
  • label: The label of the topic
{"text": "amphibians, croaks, wetlands, camouflage, metamorphosis", "label": "frogs"}
{"text": "jumping, ponds, tadpoles, moist skin, diverse habitats", "label": "frogs"}
{"text": "waterfowl, quacking, ponds, waddling, migration", "label": "ducks"}
{"text": "feathers, webbed feet, lakes, nesting, foraging", "label": "ducks"}
{"text": "dabbling, flocks, wetlands, bills, swimming", "label": "ducks"}

The training data is used to generate the topic embeddings. The more training data you have, the better the topic embeddings will be. The labels to be used when generating the topic embeddings are defined in labels-config.js.

How Clustering Works

Two clustering algorithms are available, selectable via --algorithm or CLUSTERING_ALGORITHM:

Default Algorithm (Agglomerative)

Groups similar embeddings based on cosine similarity:

  1. Calculate embeddings for all phrases in a topic
  2. Initialize the first cluster with the first embedding
  3. For each remaining embedding:
    • Calculate average similarity to each existing cluster
    • If similarity exceeds the threshold, add to the most similar cluster
    • If no cluster is similar enough and we haven't reached max clusters, create a new cluster
    • If we've reached max clusters, add to the most similar cluster regardless of threshold
  4. Process clusters that are smaller than the minimum size:
    • If the combined small clusters are still smaller than the minimum and we have valid clusters, distribute them to the most similar valid clusters
    • Otherwise, create a new "miscellaneous" cluster containing all small cluster items
  5. Calculate the average embedding for each final cluster
  6. Calculate a cohesion score for each cluster (average similarity between all embeddings and the centroid)

HDBSCAN Algorithm

Density-based clustering that automatically determines the number of clusters:

  1. Calculate embeddings for all phrases in a topic
  2. Run HDBSCAN with the configured minClusterSize parameter
  3. Any noise points (not assigned to a cluster by HDBSCAN) are reassigned to the nearest cluster using assignToCluster()
  4. Centroids and cohesion scores are recomputed after noise absorption
  5. If HDBSCAN finds no clusters, falls back to a single cluster containing all embeddings

HDBSCAN is more conservative about cluster formation and works best with larger datasets. Use --algorithm hdbscan to enable it.

Quality Metrics

Cohesion Score (per-cluster): Measures how tightly grouped the embeddings are within a cluster. Calculated as the average cosine similarity between each embedding and the cluster's centroid. Higher values (closer to 1.0) indicate tighter clusters.

Silhouette Score (global): Measures how well-separated clusters are from each other. Ranges from -1 to +1, where higher values indicate better-defined clusters. Returns 0 for single-cluster topics. Both metrics are saved to the cluster JSON files and displayed during generation.

These scores are useful for:

  • Evaluating the quality of clusters
  • Comparing clustering algorithms and configurations
  • Identifying topics that might benefit from more training data
  • Understanding why certain matches might be less reliable than others

๐Ÿ“บ Video Demo

video

About

๐Ÿท๏ธ Fast Topic Analysis is a tool for analyzing text against predefined topics using average weight embeddings and cosine similarity

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Contributors