📊 Data Health Monitor

A lightweight data monitoring system that detects data quality issues and anomalies using Python and Streamlit.

🚀 Overview

This project simulates a real-world data pipeline and provides:

Data ingestion from multiple CSV files
Data validation (null checks, spike detection)
Anomaly detection using rolling statistics (Z-score)
Interactive dashboard for monitoring data health

📸 Demo

Dashboard Overview

Anomaly Detection

View raw data

🧩 Features

📥 Load multiple CSV files automatically
⚠️ Detect null values and sudden spikes
🚨 Identify anomalies in revenue trends
📊 Visualize metrics with Streamlit
🧠 Compute a simple data health score

🏗️ Project Structure

data-health-monitor/
│
├── data/
│ └── raw/ # Generated CSV files
│
├── data_generator.py # Generates sample data
├── validator.py # Validation + anomaly logic
├── app.py # Streamlit dashboard
│
├── requirements.txt
└── README.md

⚙️ Installation

git clone <your-repo-url>
cd data-health-monitor
pip install -r requirements.txt

▶️ How to Run

1. Generate Sample Data

python data_generator.py

Run this multiple times to simulate incoming data.

2. Start Dashboard

streamlit run app.py

How It Works

1. Data Loading

Loads all CSV files from data/raw/

Validation Checks

Detects null values in revenue
Identifies abnormal spikes in user counts

Anomaly Detection

Uses rolling mean and standard deviation
Flags values with high Z-scores

Health Score

Starts from 100
Penalizes: --Data issues -- Detected anomalies

📈 Example Workflow

Generate raw data using the data generator
System ingests and aggregates CSV files
Validation checks to identify missing or inconsistent data
Anomaly detection flags unusual trends
Dashboard displays health score and insights

🧠 Design Decisions

Why CSV-based ingestion?
Simulates batch data pipelines commonly used in analytics workflows.
Why Z-score for anomaly detection?
Chosen for simplicity and interpretability in early-stage monitoring systems.
Why focus on data validation instead of modeling?
In real-world systems, ensuring data quality is critical before any downstream analysis or ML.
Why a simple health score?
Provides a quick, interpretable summary for non-technical stakeholders.

🧪 Edge Cases Handled

Empty data directory (graceful failure)
Missing values in critical columns
Division-by-zero in anomaly calculations
Sudden spikes in user activity

⚠️ Limitations

Assumes structured input data with predefined schema
Uses simple statistical methods (may not detect complex anomalies)
Batch-based processing (no real-time streaming support)
Limited scalability for large datasets

Future Improvements

Support for user-uploaded datasets
Dynamic schema detection
Alerts (email/Slack)
Scheduling with Airflow
Database integration

💬 What I Learned

Importance of data validation before analysis
Handling edge cases in data pipelines
Designing simple but effective monitoring systems
Building user-friendly dashboards for technical insights

📜 License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 Data Health Monitor

🚀 Overview

📸 Demo

Dashboard Overview

Anomaly Detection

View raw data

🧩 Features

🏗️ Project Structure

⚙️ Installation

▶️ How to Run

1. Generate Sample Data

2. Start Dashboard

How It Works

1. Data Loading

Validation Checks

Anomaly Detection

Health Score

📈 Example Workflow

🧠 Design Decisions

🧪 Edge Cases Handled

⚠️ Limitations

Future Improvements

💬 What I Learned

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
data/raw		data/raw
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
data_generator.py		data_generator.py
requirements.txt		requirements.txt
validator.py		validator.py

Folders and files

Latest commit

History

Repository files navigation

📊 Data Health Monitor

🚀 Overview

📸 Demo

Dashboard Overview

Anomaly Detection

View raw data

🧩 Features

🏗️ Project Structure

⚙️ Installation

▶️ How to Run

1. Generate Sample Data

2. Start Dashboard

How It Works

1. Data Loading

Validation Checks

Anomaly Detection

Health Score

📈 Example Workflow

🧠 Design Decisions

🧪 Edge Cases Handled

⚠️ Limitations

Future Improvements

💬 What I Learned

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages