1- Data Mining / Main Repository

Institution: Pontifical Catholic University of São Paulo (PUC-SP)

School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

Important

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
All activities comply with the academic and ethical guidelines of PUC-SP.
Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.

Tip

If you’d like to explore the Full Statistics Materials from the 1st year (not only the review), you can visit the complete repository Here.

Course Overview

This course introduces data mining techniques with a focus on unsupervised learning methods, including:

Clustering algorithms (K-Means, Affinity Propagation, Mean-Shift)
Principal Component Analysis (PCA)
Dictionary Learning
Novelty and outlier detection

Students will work on practical projects inspired by real-world problem-solving in third-sector organizations. Final deliverables will be shared in open repositories and made available to the broader community, schools, libraries, and non-profits.

Objectives

Enable students to plan, conduct, and complete a research project applying key data mining concepts, algorithms, and methodologies.

Syllabus

Fundamentals of Data Mining
Data cleaning and preparation
Predictive analysis
Clustering methods (K-Means, Affinity Propagation, Mean-Shift)
Principal Component Analysis (PCA)
Dictionary Learning
Novelty and outlier detection
Application of concepts to real-world consulting scenarios

Statistic Review - Stats Measures - Mean - Median - Mode - Variance]()

https://github.com/Quantum-Software-Development/7-DataMining-Regression-Techniques-Data-Integration

Weekly Schedule

Week	Repos	Methodology	Tools
1	Course introduction	Active methodology	–
2	Statistical Review - Stats Measures - Mean - Median - Mode - Variance	Active methodology	Python
3	Statistical Review - Variation Measures and Standard Deviation	Active methodology	Python
4	Data Mining - Concepts - Exploratory Analysis	Active methodology	Python - R
5	Data Cleaning - Preparation - Anomalies (Outliers)	Active methodology	Python
6	Data Mining - Pre Processing	Active methodology	Python
7	Regression Techniques with Data Integration	Active methodology	Python
8	Predictive K-Means Clustering Data and Figures Analysis	Active methodology	Python
9	* Project 1 – K-Means Clustering Repository Presentation	Active methodology	Python
10	Clustering Mean Shift	Active methodology	Python
11	Affinity Propagation	Active methodology	Python
12	* Project 2 – Clustering Algorithms Exploration and Comparison- K-Means - Mean Shift - Affinity Propagation	Active methodology	Python
13	Principal Component Analysis (PCA) and Isolation Forest Algorithms	Active methodology	Python
14	DBSCAN and Spectral Clustering	Active methodology	Python
15	* Project 3 – Clustering Algorithms Exploration and Comparison- K-Means - Mean Shift - - Dbscan	Active methodology	Python
16	Dictionary-Based Feature Grouping for LLM/AI Pipelines	Active methodology	Python
17	P2 Exam	Written (Individual)	–
18	P3 Exam & Grade Closure	Written (Individual)	–
19	Final grade submission	–	–

Tools and Technologies

Programming Language: Python
Libraries: NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn
Environment: Jupyter Notebook or other Python IDEs

Installation and Setup

Follow these steps to set up your local environment for the course projects:

1. Clone the repository

git clone https://github.com/<username>/<repository-name>.git
cd <repository-name>

2. Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate   \# Mac/Linux
venv\Scripts\activate      \# Windows

3. Install dependencies Make sure pip is updated:


pip install --upgrade pip

Then install the required packages:


pip install -r requirements.txt

(If requirements.txt is not provided, install manually:)


pip install numpy pandas scikit-learn matplotlib seaborn jupyter

4. Run Jupyter Notebook

jupyter notebook

5. Open course notebooks and start practicing.

I - Intoductioon and Assessment

Exam	Date	Format	Weight
P1	01/10/2025	Written – Individual	Arithmetic mean
P2	19/11/2025	Written – Individual	Arithmetic mean
P3	Substitution exam	Written – Individual	Replaces lowest score

Final Grade: Arithmetic mean of assessments.

II - class_2- Introduction - Data Mining With Python

☞ Access Booklet

Example 1

The following sample lists the number of minutes that 60 cable TV users watched content from their package in the last two hours. Construct a frequency distribution with 8 classes and build a histogram.

Data:

20, 55, 5, 64, 78, 49, 91, 87, 18, 83, 33, 39, 30, 31, 59, 85, 102, 24, 27, 28,
92, 108, 98, 67, 85, 109, 48, 19, 32, 69, 24, 59, 6, 49, 116, 37, 92, 43, 101, 60,
55, 107, 25, 33, 57, 25, 17, 49, 24, 101, 14, 45, 73, 120, 91, 2, 11, 47, 21, 38

Step 1: Determine Range and Number of Classes

Minimum value: 2
Maximum value: 120
Number of classes ($k$): 8 (given)

Step 2: Calculate Class Width

$$ \huge w = \left\lceil \frac{\text{max} - \text{min}}{k} \right\rceil = \left\lceil \frac{120 - 2}{8} \right\rceil = 15 $$

Step 3: Construct Class Intervals (from minimum value)

Class Interval	Explanation
2 - 16	Starts from minimum 2
17 - 31	16 + 1 to 31
32 - 46	Next range
47 - 61	Next range
62 - 76	Next range
77 - 91	Next range
92 - 106	Next range
107 - 121	Covers maximum 120

Step 4: Frequency Distribution Table

Class Interval	Frequency
2 - 16	5
17 - 31	14
32 - 46	8
47 - 61	13
62 - 76	5
77 - 91	8
92 - 106	6
107 - 121	5

Step 5: Calculate Midpoints for Each Class

$$ \Huge x_i = \frac{\text{Lower limit} + \text{Upper limit}}{2} $$

Class Interval	Midpoint ($x_i$)
2 - 16	9
17 - 31	24
32 - 46	39
47 - 61	54
62 - 76	69
77 - 91	84
92 - 106	99
107 - 121	114

Step 6: Calculate Mean Using Frequency and Midpoints

Mean: ($\bar{x}$) is calculated by:

$$ \Huge \bar{x} = \frac{\sum f_i x_i}{\sum f_i} $$

Where: $f_i$ = frequency, $x_i$ = Midpoint.

Calculate each product:

Class Interval	$f_i$	$x_i$	$f_i \times x_i$
2 - 16	5	9	45
17 - 31	14	24	336
32 - 46	8	39	312
47 - 61	13	54	702
62 - 76	5	69	345
77 - 91	8	84	672
92 - 106	6	99	594
107 - 121	5	114	570

Sum frequencies: $5 + 14 + 8 + 13 + 5 + 8 + 6 + 5$ = 64

Sum of products: $45 + 336 + 312 + 702 + 345 + 672 + 594 + 570$ = 3576

Calculate mean:

$$ \huge \bar{x} = \frac{3576}{64} = 55.875 $$

Step 7: Histogram, Bar Plot and Time Series Frequency Distribution Over Time

Construct a histogram, bar plot and Time Series with class intervals on the x-axis and frequencies on the y-axis.
Each bar height corresponds to the frequency of the class.

☞ Access Code

☞ Access Dataset

☞ Access Plots

###Frequency Analysis and Time Series Visualization

This notebook demonstrates how to perform frequency analysis on a CSV dataset, visualize results with histograms and bar plots, and create a time series chart using Python.

1. Install and Import Libraries

# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. Load Dataset

# Load CSV file (semicolon-separated)
df = pd.read_csv('chose your dataset', sep=';')

# Select only the "day" column
df1 = df['day']

3. Calculate Frequencies

# Calculate absolute frequency (ascending order)
freq_abs = pd.Series(df1).value_counts(ascending=True)

# Calculate relative frequency (normalized, 3 decimal places)
freq_rel = pd.Series(df1).value_counts(normalize=True).round(3)

# Create a DataFrame with both measures
df_freq = pd.DataFrame({
    'Absolute Frequency': freq_abs,
    'Relative Frequency': freq_rel
})

# Display the frequency table
display(df_freq)

4. Histogram (Dark Theme)

# Create figure and axes with dark background
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 4))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')

# Plot histogram
sns.histplot(df1, color='turquoise', ax=ax)

# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')

# Show plot
plt.show()

5. Bar Plot (Dark Theme)

# Create figure and axes
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(10, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')

# Bar plot of absolute frequency
df_freq['Absolute Frequency'].plot(kind='bar', color="turquoise", ax=ax)

# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.xticks(rotation=0, color='white')
plt.yticks(color='white')

# Show plot
plt.show()

6. Time Series Preparation

# Inspect available columns
print(df.columns)

# Create a new DataFrame for time series analysis
df_time_series = df[['day', 'month']].copy()

# Add dummy year (if year column is missing)
df_time_series['year'] = 2022

# Convert to strings for concatenation
df_time_series['day'] = df_time_series['day'].astype(str)
df_time_series['year'] = df_time_series['year'].astype(str)

# Create "date" column in dd-MMM-yyyy format
df_time_series['date'] = df_time_series['day'] + '-' + df_time_series['month'] + '-' + df_time_series['year']
df_time_series['date'] = pd.to_datetime(df_time_series['date'], format='%d-%b-%Y')

# Set "date" as index
df_time_series = df_time_series.set_index('date')

# Count occurrences per day
daily_counts = df_time_series.groupby(df_time_series.index).size()

# Display first rows
display(daily_counts.head())

7. Time Series Plot (Dark Theme)

# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')

# Plot time series
plt.plot(daily_counts, color='turquoise')

# Customize labels and ticks
plt.title("Frequency Distribution Over Time", color='white')
plt.xlabel("Date", color='white')
plt.ylabel("Frequency", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')

# Show plot
plt.show()

Summary

Dummy Year: 2022 was used when year column was missing.

Visualizations: Histograms, bar plots, and time series chart.

III - class_3- Stats Review

Tip

Access Class_3

IV - class_4- Data Mining - Concepts - Exploratory Analysis

Tip

Access Class_4

V - class_5- Data Cleaning - Preparation - Anomalies(Outliers)

Tip

Access Class_5

VI - class_6- Data Mining - Pre Processing

Tip

Access Class_6

VII - class_7- Normalization

Tip

Access Class_7

⚠️ Coming Soon

VIII - class_8 - KMeans_NonHierarchical_Clustering

Tip

Access Class_8 - KMeans_NonHierarchical_Clustering

⚠️ Coming Soon

IX - lass_8 - KMeans_NonHierarchical_Clustering

Tip

Access Class_8

⚠️ Coming Soon

Bibliography

1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence – A Machine Learning Approach. 2nd Ed. LTC.

3. Larson & Farber (2015). Applied Statistics. Pearson.

Complementary Bibliography

THOMAS, C. Data Mining. IntechOpen, 2018.
HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. Automated Machine Learning: Methods, Systems, Challenges. Springer Nature, 2019.
NETTO, A.; MACIEL, F. Python para Data Science e Machine Learning Descomplicado. Alta Books, 2021.
RUSSELL, S. J.; NORVIG, P. Artificial Intelligence: A Modern Approach. GEN LTC, 2022.
SUD, K.; ERDOGMUS, P.; KADRY, S. Introduction to Data Science and Machine Learning. IntechOpen, 2020.

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 505 Commits
.github		.github
Workbooks		Workbooks
class_1-Introduction		class_1-Introduction
class_16-Dictionary/notebooks		class_16-Dictionary/notebooks
class_2 - Introduction - Data Mining With Python		class_2 - Introduction - Data Mining With Python
class_3 - Stats Review		class_3 - Stats Review
class_4-DataMining_Concepts_ExploratoryAnalysis		class_4-DataMining_Concepts_ExploratoryAnalysis
class_5-DataMining_DataCleaning_FraudDeteccion_RandonForest_LogistcRegeression		class_5-DataMining_DataCleaning_FraudDeteccion_RandonForest_LogistcRegeression
class_6-DataMining_PreProcessing		class_6-DataMining_PreProcessing
class_7-Normalization_Code		class_7-Normalization_Code
class_8-KMeans_NonHierarchical_Clustering		class_8-KMeans_NonHierarchical_Clustering
class_9-Presentation_K-Means _Clustering		class_9-Presentation_K-Means _Clustering
class__10-Mean_Shift		class__10-Mean_Shift
class__11-Affinity Propagation Algorithm		class__11-Affinity Propagation Algorithm
class__12_Project 2 – Clustering Algorithms Exploration and Comparison		class__12_Project 2 – Clustering Algorithms Exploration and Comparison
class__13-Principal Component Analysis (PCA) and Isolation Forest Algorithms		class__13-Principal Component Analysis (PCA) and Isolation Forest Algorithms
class__14-DBSCAN_and_Spectral_Clustering		class__14-DBSCAN_and_Spectral_Clustering
class__15-DataMining_Project_3_Clustering		class__15-DataMining_Project_3_Clustering
readmes		readmes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CNAME		CNAME
Data Cleaning by Zahra Amini .pdf		Data Cleaning by Zahra Amini .pdf
Determining the Intrinsic Structure of Public Software Development History.pdf		Determining the Intrinsic Structure of Public Software Development History.pdf
LICENSE		LICENSE
Probabilistic AI.pdf		Probabilistic AI.pdf
README.md		README.md
Template_Readme_Headre_Footer		Template_Readme_Headre_Footer
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
setup-python.yml		setup-python.yml
test_code.py		test_code.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

1- Data Mining / Main Repository

Institution: Pontifical Catholic University of São Paulo (PUC-SP)

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

If you’d like to explore the Full Statistics Materials from the 1st year (not only the review), you can visit the complete repository Here.

Table of Contents

Installation and Setup

I - Intoductioon and Assessment

II - class_2- Introduction - Data Mining With Python

Step 1: Determine Range and Number of Classes

Step 2: Calculate Class Width

Step 3: Construct Class Intervals (from minimum value)

Step 4: Frequency Distribution Table

Step 5: Calculate Midpoints for Each Class

Step 6: Calculate Mean Using Frequency and Midpoints

Mean: ($\bar{x}$) is calculated by:

Where: $f_i$ = frequency, $x_i$ = Midpoint.

Calculate each product:

Sum frequencies: $5 + 14 + 8 + 13 + 5 + 8 + 6 + 5$ = 64

Sum of products: $45 + 336 + 312 + 702 + 345 + 672 + 594 + 570$ = 3576

Calculate mean:

Step 7: Histogram, Bar Plot and Time Series Frequency Distribution Over Time

1. Install and Import Libraries

2. Load Dataset

3. Calculate Frequencies

4. Histogram (Dark Theme)

5. Bar Plot (Dark Theme)

6. Time Series Preparation

7. Time Series Plot (Dark Theme)

III - class_3- Stats Review

IV - class_4- Data Mining - Concepts - Exploratory Analysis

V - class_5- Data Cleaning - Preparation - Anomalies(Outliers)

VI - class_6- Data Mining - Pre Processing

VII - class_7- Normalization

VIII - class_8 - KMeans_NonHierarchical_Clustering

IX - lass_8 - KMeans_NonHierarchical_Clustering

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Uh oh!

Languages