Skip to content

Sagar-S-R/GraphBuilder-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

7 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

GraphBuilder-RAG: Graph-Enhanced Retrieval Augmented Generation System

A production-grade, modular framework for building and querying knowledge graphs from heterogeneous documents with advanced RAG capabilities.

๐ŸŽฏ System Overview

GraphBuilder-RAG extracts structured knowledge from documents, validates facts, builds versioned knowledge graphs, and provides hybrid retrieval with hallucination detection.

Key Features

  • Multi-format ingestion: HTML, PDF, CSV, JSON APIs
  • Intelligent extraction: Rule-based + LLM-based triple extraction
  • Fact validation: Ontology rules + external verification
  • Versioned knowledge graph: Neo4j with full provenance tracking
  • Hybrid retrieval: FAISS semantic search + Neo4j graph traversal
  • Hallucination detection: GraphVerify for claim validation
  • Self-healing agents: Auto-verification, conflict resolution, schema evolution

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Ingestion     โ”‚ โ†’ MongoDB GridFS (raw docs)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Normalization   โ”‚ โ†’ MongoDB (normalized_docs)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Extraction    โ”‚ โ†’ MongoDB (candidate_triples)
โ”‚  DeepSeek 1.5B  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Validation    โ”‚ โ†’ MongoDB (validated_triples)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Fusion      โ”‚ โ†’ Neo4j (knowledge graph)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Query Pipeline              โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚  FAISS   โ”‚  โ”‚   Neo4j     โ”‚ โ”‚
โ”‚  โ”‚ Semantic โ”‚  โ”‚   Graph     โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚
โ”‚                โ†“                โ”‚
โ”‚         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚
โ”‚         โ”‚   Prompt   โ”‚          โ”‚
โ”‚         โ”‚  Builder   โ”‚          โ”‚
โ”‚         โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚
โ”‚               โ†“                 โ”‚
โ”‚      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”‚
โ”‚      โ”‚ Groq Llama 70B โ”‚         โ”‚
โ”‚      โ”‚   Reasoning    โ”‚         โ”‚
โ”‚      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ”‚
โ”‚               โ†“                 โ”‚
โ”‚      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”‚
โ”‚      โ”‚ GraphVerify    โ”‚         โ”‚
โ”‚      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿง  Models Used

  • Extraction: DeepSeek-R1-Distill-Qwen-1.5B (deepseek-r1:1.5b) via Ollama (local)
  • Reasoning/QA: Llama-3.3-70B-Versatile via Groq Cloud API (fast inference)
  • Embeddings: BGE-small (BAAI/bge-small-en-v1.5)

๐Ÿ’พ Data Stores

  • MongoDB: Document storage, triples, metadata, audit logs
  • Neo4j: Canonical knowledge graph with versioning
  • FAISS: Vector similarity search (CPU-based)

๐Ÿ“ Project Structure

graphbuilder-rag/
โ”œโ”€โ”€ services/
โ”‚   โ”œโ”€โ”€ ingestion/          # Document ingestion
โ”‚   โ”œโ”€โ”€ normalization/      # Text extraction & cleaning
โ”‚   โ”œโ”€โ”€ extraction/         # Triple extraction (rules + LLM)
โ”‚   โ”œโ”€โ”€ embedding/          # BGE embeddings + FAISS
โ”‚   โ”œโ”€โ”€ entity_resolution/  # Entity linking & deduplication
โ”‚   โ”œโ”€โ”€ validation/         # Fact validation engine
โ”‚   โ”œโ”€โ”€ fusion/             # Neo4j graph fusion
โ”‚   โ”œโ”€โ”€ retrieval/          # Hybrid retrieval
โ”‚   โ”œโ”€โ”€ query/              # QA service with GraphVerify
โ”‚   โ””โ”€โ”€ agents/             # Self-healing agents
โ”œโ”€โ”€ shared/
โ”‚   โ”œโ”€โ”€ config/             # Configuration management
โ”‚   โ”œโ”€โ”€ database/           # DB connectors
โ”‚   โ”œโ”€โ”€ models/             # Pydantic schemas
โ”‚   โ”œโ”€โ”€ prompts/            # LLM prompt templates
โ”‚   โ””โ”€โ”€ utils/              # Shared utilities
โ”œโ”€โ”€ workers/                # Celery task workers
โ”œโ”€โ”€ api/                    # FastAPI endpoints
โ”œโ”€โ”€ tests/                  # Unit & integration tests
โ”œโ”€โ”€ docker/                 # Docker configs
โ””โ”€โ”€ deployment/             # K8s/compose configs

๐Ÿš€ Quick Start

1. Install Services

macOS:

brew install mongodb-community neo4j redis ollama tesseract poppler

Linux:

# See SETUP.md for detailed Linux installation

2. Start Services

# macOS
brew services start mongodb-community
brew services start neo4j
brew services start redis
ollama serve &

# Pull Ollama model (for extraction only)
ollama pull deepseek-r1:1.5b

# Get Groq API key for Q&A (free tier available)
# Visit: https://console.groq.com/keys

3. Setup Project

# Clone and setup
git clone <repository-url>
cd graphbuilder-rag
chmod +x setup.sh
./setup.sh

4. Run Application

Option A: Separate terminals

# Terminal 1: API
python -m api.main

# Terminal 2: Worker
celery -A workers.tasks worker --loglevel=info --concurrency=4

# Terminal 3: Beat
celery -A workers.tasks beat --loglevel=info

# Terminal 4: Agents (optional)
python -m agents.agents

Option B: Tmux (all-in-one)

chmod +x run.sh
./run.sh

5. Test the API

Ingest a document:

curl -X POST http://localhost:8000/api/v1/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "source": "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "source_type": "HTML",
    "metadata": {"topic": "AI"}
  }'

Query the system: curl -X POST http://localhost:8000/api/v1/query
-H "Content-Type: application/json"
-d '{ "question": "What are the side effects of aspirin?", "max_chunks": 5, "graph_depth": 2 }'


## ๐Ÿ”ง Configuration

Edit `config/config.yaml`:

```yaml
mongodb:
  uri: mongodb://localhost:27017
  database: graphbuilder_rag

neo4j:
  uri: bolt://localhost:7687
  user: neo4j
  password: password

ollama:
  base_url: http://localhost:11434
  extraction_model: deepseek-r1:1.5b  # For entity/relationship extraction

groq:
  api_key: your-groq-api-key-here  # Get from https://console.groq.com/keys
  model: llama-3.3-70b-versatile  # For fast Q&A reasoning

faiss:
  index_type: IndexFlatIP
  embedding_dim: 384

agents:
  reverify_interval: 86400  # 24 hours
  conflict_check_interval: 3600  # 1 hour

๐Ÿ“Š Monitoring

Access metrics at:

  • API Health: http://localhost:8000/health
  • Metrics: http://localhost:8000/metrics
  • Neo4j Browser: http://localhost:7474
  • MongoDB Compass: mongodb://localhost:27017

๐Ÿงช Testing

# Run all tests
pytest tests/

# Run specific service tests
pytest tests/services/extraction/

# Run integration tests
pytest tests/integration/

๐Ÿ“– Documentation

Setup & Installation

Architecture & Design

Usage & Testing

Advanced Topics

๐Ÿค Contributing

See CONTRIBUTING.md

๐Ÿ“„ License

MIT License - see LICENSE

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages