[๐ง๐ท Portuguรชs] [๐บ๐ธ English]
Institution: Pontifical Catholic University of Sรฃo Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva
Important
- Projects and deliverables may be made publicly available whenever possible.
- The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
- All activities comply with the academic and ethical guidelines of PUC-SP.
- Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.
๐ถ Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4
๐บ For better resolution, watch the video on YouTube.
Tip
-
If youโd like to explore the Full Statistics Materials from the 1st year (not only the review), you can visit the complete repository Here.
- Course Overview
- Objectives
- Syllabus
- Weekly Schedule
- Tools and Technologies
- Installation and Setup
- Assessment
- Bibliography
- Notes
This course introduces data mining techniques with a focus on unsupervised learning methods, including:
- Clustering algorithms (K-Means, Affinity Propagation, Mean-Shift)
- Principal Component Analysis (PCA)
- Dictionary Learning
- Novelty and outlier detection
Students will work on practical projects inspired by real-world problem-solving in third-sector organizations. Final deliverables will be shared in open repositories and made available to the broader community, schools, libraries, and non-profits.
Enable students to plan, conduct, and complete a research project applying key data mining concepts, algorithms, and methodologies.
- Fundamentals of Data Mining
- Data cleaning and preparation
- Predictive analysis
- Clustering methods (K-Means, Affinity Propagation, Mean-Shift)
- Principal Component Analysis (PCA)
- Dictionary Learning
- Novelty and outlier detection
- Application of concepts to real-world consulting scenarios
Statistic Review - Stats Measures - Mean - Median - Mode - Variance]()
https://github.com/Quantum-Software-Development/7-DataMining-Regression-Techniques-Data-Integration
- Programming Language: Python
- Libraries: NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn
- Environment: Jupyter Notebook or other Python IDEs
Follow these steps to set up your local environment for the course projects:
1. Clone the repository
git clone https://github.com/<username>/<repository-name>.git
cd <repository-name>
2. Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate \# Mac/Linux
venv\Scripts\activate \# Windows
3. Install dependencies
Make sure pip is updated:
pip install --upgrade pip
Then install the required packages:
pip install -r requirements.txt
(If requirements.txt is not provided, install manually:)
pip install numpy pandas scikit-learn matplotlib seaborn jupyter
4. Run Jupyter Notebook
jupyter notebook
5. Open course notebooks and start practicing.
| Exam | Date | Format | Weight |
|---|---|---|---|
| P1 | 01/10/2025 | Written โ Individual | Arithmetic mean |
| P2 | 19/11/2025 | Written โ Individual | Arithmetic mean |
| P3 | Substitution exam | Written โ Individual | Replaces lowest score |
Final Grade: Arithmetic mean of assessments.
โ Access Booklet
The following sample lists the number of minutes that 60 cable TV users watched content from their package in the last two hours. Construct a frequency distribution with 8 classes and build a histogram.
Data:
20, 55, 5, 64, 78, 49, 91, 87, 18, 83, 33, 39, 30, 31, 59, 85, 102, 24, 27, 28,
92, 108, 98, 67, 85, 109, 48, 19, 32, 69, 24, 59, 6, 49, 116, 37, 92, 43, 101, 60,
55, 107, 25, 33, 57, 25, 17, 49, 24, 101, 14, 45, 73, 120, 91, 2, 11, 47, 21, 38
Step 1: Determine Range and Number of Classes
- Minimum value: 2
- Maximum value: 120
- Number of classes (
$k$ ): 8 (given)
Step 2: Calculate Class Width
Step 3: Construct Class Intervals (from minimum value)
| Class Interval | Explanation |
|---|---|
| 2 - 16 | Starts from minimum 2 |
| 17 - 31 | 16 + 1 to 31 |
| 32 - 46 | Next range |
| 47 - 61 | Next range |
| 62 - 76 | Next range |
| 77 - 91 | Next range |
| 92 - 106 | Next range |
| 107 - 121 | Covers maximum 120 |
Step 4: Frequency Distribution Table
| Class Interval | Frequency |
|---|---|
| 2 - 16 | 5 |
| 17 - 31 | 14 |
| 32 - 46 | 8 |
| 47 - 61 | 13 |
| 62 - 76 | 5 |
| 77 - 91 | 8 |
| 92 - 106 | 6 |
| 107 - 121 | 5 |
Step 5: Calculate Midpoints for Each Class
| Class Interval | Midpoint ( |
|---|---|
| 2 - 16 | 9 |
| 17 - 31 | 24 |
| 32 - 46 | 39 |
| 47 - 61 | 54 |
| 62 - 76 | 69 |
| 77 - 91 | 84 |
| 92 - 106 | 99 |
| 107 - 121 | 114 |
Step 6: Calculate Mean Using Frequency and Midpoints
Mean: ($\bar{x}$ ) is calculated by:
| Class Interval | |||
|---|---|---|---|
| 2 - 16 | 5 | 9 | 45 |
| 17 - 31 | 14 | 24 | 336 |
| 32 - 46 | 8 | 39 | 312 |
| 47 - 61 | 13 | 54 | 702 |
| 62 - 76 | 5 | 69 | 345 |
| 77 - 91 | 8 | 84 | 672 |
| 92 - 106 | 6 | 99 | 594 |
| 107 - 121 | 5 | 114 | 570 |
Sum frequencies: $5 + 14 + 8 + 13 + 5 + 8 + 6 + 5$ = 64
Sum of products: $45 + 336 + 312 + 702 + 345 + 672 + 594 + 570$ = 3576
Step 7: Histogram, Bar Plot and Time Series Frequency Distribution Over Time
- Construct a histogram, bar plot and Time Series with class intervals on the x-axis and frequencies on the y-axis.
- Each bar height corresponds to the frequency of the class.
โ Access Code
โ Access Dataset
โ Access Plots
###Frequency Analysis and Time Series Visualization
This notebook demonstrates how to perform frequency analysis on a CSV dataset, visualize results with histograms and bar plots, and create a time series chart using Python.
1. Install and Import Libraries
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns2. Load Dataset
# Load CSV file (semicolon-separated)
df = pd.read_csv('chose your dataset', sep=';')
# Select only the "day" column
df1 = df['day']3. Calculate Frequencies
# Calculate absolute frequency (ascending order)
freq_abs = pd.Series(df1).value_counts(ascending=True)
# Calculate relative frequency (normalized, 3 decimal places)
freq_rel = pd.Series(df1).value_counts(normalize=True).round(3)
# Create a DataFrame with both measures
df_freq = pd.DataFrame({
'Absolute Frequency': freq_abs,
'Relative Frequency': freq_rel
})
# Display the frequency table
display(df_freq)4. Histogram (Dark Theme)
# Create figure and axes with dark background
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 4))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')
# Plot histogram
sns.histplot(df1, color='turquoise', ax=ax)
# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')
# Show plot
plt.show()5. Bar Plot (Dark Theme)
# Create figure and axes
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(10, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')
# Bar plot of absolute frequency
df_freq['Absolute Frequency'].plot(kind='bar', color="turquoise", ax=ax)
# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.xticks(rotation=0, color='white')
plt.yticks(color='white')
# Show plot
plt.show()6. Time Series Preparation
# Inspect available columns
print(df.columns)
# Create a new DataFrame for time series analysis
df_time_series = df[['day', 'month']].copy()
# Add dummy year (if year column is missing)
df_time_series['year'] = 2022
# Convert to strings for concatenation
df_time_series['day'] = df_time_series['day'].astype(str)
df_time_series['year'] = df_time_series['year'].astype(str)
# Create "date" column in dd-MMM-yyyy format
df_time_series['date'] = df_time_series['day'] + '-' + df_time_series['month'] + '-' + df_time_series['year']
df_time_series['date'] = pd.to_datetime(df_time_series['date'], format='%d-%b-%Y')
# Set "date" as index
df_time_series = df_time_series.set_index('date')
# Count occurrences per day
daily_counts = df_time_series.groupby(df_time_series.index).size()
# Display first rows
display(daily_counts.head())7. Time Series Plot (Dark Theme)
# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')
# Plot time series
plt.plot(daily_counts, color='turquoise')
# Customize labels and ticks
plt.title("Frequency Distribution Over Time", color='white')
plt.xlabel("Date", color='white')
plt.ylabel("Frequency", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')
# Show plot
plt.show()Dummy Year: 2022 was used when year column was missing.
Visualizations: Histograms, bar plots, and time series chart.
III - class_3- Stats Review
Tip
Access Class_3
Tip
Access Class_4
Tip
Access Class_5
Tip
Access Class_6
VII - class_7- Normalization
Tip
Access Class_8 - KMeans_NonHierarchical_Clustering
โ ๏ธ Coming Soon
1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.
2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence โ A Machine Learning Approach. 2nd Ed. LTC.
3. Larson & Farber (2015). Applied Statistics. Pearson.
- THOMAS, C. Data Mining. IntechOpen, 2018.
- HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. Automated Machine Learning: Methods, Systems, Challenges. Springer Nature, 2019.
- NETTO, A.; MACIEL, F. Python para Data Science e Machine Learning Descomplicado. Alta Books, 2021.
- RUSSELL, S. J.; NORVIG, P. Artificial Intelligence: A Modern Approach. GEN LTC, 2022.
- SUD, K.; ERDOGMUS, P.; KADRY, S. Introduction to Data Science and Machine Learning. IntechOpen, 2020.
๐ธเน My Contacts Hub
โโโโโโโโโโโโโโ ๐ญโ โโโโโโโโโโโโโโ
โฃโขโค Back to Top
Copyright 2025 Quantum Software Development. Code released under the MIT License license.


