Chapters:
Synkoc Data Science Internship · Week 1 · Lesson 2 of 11
Statistics for
Data Science
The mathematics that powers every ML algorithm — Mean, Variance, Probability & Correlation explained clearly with real AI applications.
📊 Mean & Median
📏 Variance
🎲 Probability
🔗 Correlation
🧑‍💻
Synkoc Instructor
AI/ML Professional · Bangalore
⏱ ~55 minutes
🟢 Beginner Friendly
Why Statistics Powers ML
Before a model can learn, we must understand the data it will train on. Statistics gives us the tools to describe, summarise, and find patterns. Every ML algorithm is built on statistical foundations.
📊
Mean
Centre of data. Used in normalisation, loss functions, and model evaluation.
ML: feature scaling, loss
📏
Variance & Std Dev
How spread out data is. The bias-variance tradeoff in ML is built on this concept.
ML: overfitting detection
🎲
Probability
Likelihood of events. Every classification model outputs a probability between 0 and 1.
ML: classification output
🔗
Correlation
How two variables move together. Feature selection relies entirely on correlation analysis.
ML: feature selection
Chapter 1 of 4
01
Mean, Median & Mode
Three ways to measure the centre of data. The foundation of every summary statistic you will compute on a real dataset.
Mean, Median & Mode
Three measures of central tendency — each answers "what is the typical value?" in a different way. Know all three and when to use each one.
Mean (Average)
Add all values, divide by count. Most common measure. Sensitive to outliers — one extreme value can make it misleading.
scores = [70, 80, 90, 60, 85]
mean = 385 / 5 = 77.0
⚡ ML: np.mean(), loss calculation, normalisation
📍
Median (Middle Value)
Sort, pick the middle value. Not affected by outliers — reliable for skewed data like salaries and house prices.
sorted: [60, 70, 80, 85, 90]
median = 80 (middle value)
⚡ ML: robust imputation, skewed features
🏆
Mode (Most Frequent)
The value that appears most often. The only measure that works on categorical data — labels, colours, categories.
labels = ["A","B","A","C","A"]
mode = "A" (appears 3 times)
⚡ ML: class imbalance, majority baseline
💡
When to use which?
Use mean for symmetric numeric data. Use median when outliers exist. Use mode for categories and class labels.
House prices → MEDIAN
Exam scores → MEAN
Eye colour → MODE
⚡ Always check for outliers first
Median & Mode — When Mean Fails
The mean is not always the best measure. Outliers can destroy it. Median and Mode are the alternatives used in real data science work.
💡 Real Life Analogy — The Salary Problem
Imagine 9 employees earning ₹30,000 per month and 1 CEO earning ₹10,00,000. The mean salary is ₹1,27,000 — but 9 out of 10 people earn far less than that. The mean is misleading because one extreme value — the outlier — pulled it up. The median, which is the middle value when sorted, gives ₹30,000 — which actually represents what a typical employee earns. This is why news reports about income inequality use median salary, not mean salary.
📈
Median — The Middle Value
Sort all values. Take the middle one. If even count, average the two middle values. Not affected by outliers. Use for skewed data like house prices, salaries, income.
scores = [45, 60, 62, 70, 75,
78, 80, 82, 95, 99]
# sorted, 10 values (even count)
# median = (75 + 78) / 2 = 76.5

import statistics
print(statistics.median(scores)) # 76.5
🏆
Mode — The Most Frequent Value
The value that appears most often. Use for categorical data — most common label, most frequent category, most purchased product.
grades = ["A","B","A","C","A",
"B","A","C","B","A"]
# A appears 5 times — most frequent

print(statistics.mode(grades)) # A

# ML: most frequent class label
# used in majority-class classifier
ML Connection: When you call df["salary"].describe() in Pandas you see mean and 50th percentile (median). Pandas uses median for imputing missing values in skewed columns: df.fillna(df.median()). The most common imputation strategy in real ML projects.
Mean vs Median vs Mode — How to Choose
Choosing the wrong measure gives you the wrong insight. Use this decision framework for every dataset you encounter.
⚖️
Use Mean when...
Data is roughly symmetric with no extreme outliers. Works perfectly for exam scores, temperature readings, model accuracy values.
accuracies = [0.87, 0.89, 0.91,
0.88, 0.90, 0.92]
# Symmetric — no outliers
# Mean is appropriate here
mean_acc = sum(accuracies)/len(accuracies)
print(f"Mean accuracy: {mean_acc:.3f}")
⚠️
Use Median when...
Data is skewed or has outliers. House prices, salaries, time spent on a website, transaction amounts — all have outliers. Always use median.
house_prices = [50, 55, 60, 65,
70, 75, 500] # outlier!
# Mean = 125 (misleading!)
# Median = 65 (representative)
import statistics
print(statistics.median(house_prices)) # 65
central_tendency_ml.py
1# In ML: always check for skewness before choosing imputation strategy
2import statistics as stats
3ages = [22, 25, 27, 28, 30, 32, 85] # 85 is an outlier
4print(f"Mean: {stats.mean(ages):.1f}") # 35.6 — pulled by outlier
5print(f"Median: {stats.median(ages):.1f}") # 28.0 — representative
6# Rule: if mean >> median → data is right-skewed → use median for imputation
7# Rule: if mean ≈ median → data is symmetric → mean is fine
Chapter 2 of 4
02
Variance & Standard Deviation
Measure how spread out data is. The foundation of the bias-variance tradeoff — the most important concept in ML engineering.
Variance & Standard Deviation
Mean tells you the centre. Variance tells you the spread. Two datasets can have the same mean but completely different variance — changing everything in ML.
📏
Variance = Average Squared Distance from Mean
For each value: subtract the mean and square the result. Average all those squared differences. Square root of variance = standard deviation — in the same units as your original data.
scores = [70, 80, 90, 60, 85]
mean = 77.0
diffs = [-7, 3, 13, -17, 8]
sq_diffs = [49, 9, 169, 289, 64]
variance = (49+9+169+289+64) / 5 = 116.0
std_dev = sqrt(116) ≈ 10.77
ML: High model variance = overfitting. Low variance = underfitting. The bias-variance tradeoff is the #1 engineering challenge in machine learning.
What Spread Looks Like
Two classes, same mean score of 75. But their spreads are completely different — and this changes how a model treats them.
Class A — Low Variance
σ² = 12.5 · Consistent results
72
74
76
77
78
7274767778
✅ Easy for ML — predictable
Class B — High Variance
σ² = 284 · Highly inconsistent
40
60
75
90
110
40607590110
⚠️ Harder for ML — noisy data
🤖
Why this matters in your ML projects
In Pandas (Week 2), a column with std dev near zero is useless — every row has almost the same value. A column with very high std dev may need normalisation before training. Always check variance before modelling.
Standard Deviation — Spread in Original Units
Variance is in squared units — hard to interpret. Standard deviation is the square root of variance, giving spread in the original units. This is what you use in practice.
💡 Real Life Analogy — The Tailor
A tailor making school uniforms knows the average height of students is 155 cm. But she also needs to know how much heights vary. Variance would give her something in cm squared — useless for cutting fabric. Standard deviation gives her the answer in centimetres: heights typically vary by plus or minus 8 cm from the mean. She now knows to make sizes from 147 cm to 163 cm to cover most students. Standard deviation gives you the spread in the same units as your data — directly usable.
📏
std = sqrt(variance) — spread in original units
If exam scores have a mean of 70 and a standard deviation of 10, you immediately know: most students scored between 60 and 80. About 68% of all values in a normal distribution fall within one standard deviation of the mean. About 95% fall within two standard deviations.
scores = [60, 65, 68, 70, 72, 75, 78, 80, 85, 90]

mean = sum(scores) / len(scores) # 74.3
var = sum((x - mean)**2 for x in scores) / len(scores)
std = var ** 0.5 # 8.2

print(f"Mean: {mean:.1f} Std: {std:.1f}")
print(f"Typical range: {mean-std:.1f} to {mean+std:.1f}")
ML Connection: StandardScaler in sklearn subtracts the mean and divides by standard deviation for every feature. This is called standardisation. It puts all features on the same scale so that a feature with large values like salary does not dominate a feature with small values like age.
Normal Distribution — The Bell Curve
The most important distribution in statistics and machine learning. Many natural phenomena follow this pattern — heights, exam scores, measurement errors, model prediction errors.
💡 The 68-95-99.7 Rule — Memorise This
In any normal distribution: 68% of values fall within 1 standard deviation of the mean. 95% fall within 2 standard deviations. 99.7% fall within 3 standard deviations. This means if exam scores have mean 70 and std 10: 68% of students scored between 60 and 80. 95% scored between 50 and 90. Only 0.3% scored below 40 or above 100 — the extreme outliers.
📈
Why Normal Distribution Matters in ML
Many ML algorithms assume features are normally distributed. Linear regression assumes normally distributed errors. Naive Bayes assumes normally distributed features. When your data is skewed, you transform it to make it more normal.
import statistics
scores = [55,60,65,68,70,72,
75,78,80,85,90,95]
mean = statistics.mean(scores)
std = statistics.stdev(scores)
print(f"Mean: {mean:.1f}")
print(f"Std: {std:.1f}")
⚖️
Z-Score — How Far from Mean
A z-score tells you how many standard deviations a value is from the mean. Z-score of 0 = at the mean. Z-score of 2 = 2 std above mean (top 2.5%). Used for outlier detection.
def z_score(value, mean, std):
return (value - mean) / std

# score of 95 in a class
# mean=70, std=10
z = z_score(95, 70, 10)
print(f"Z-score: {z}") # 2.5
# 2.5 std above average — top 1%
ML Connection: StandardScaler converts every feature value to its z-score. After scaling, every feature has mean=0 and std=1. This is why standardised features make gradient descent converge much faster in neural networks — all features are on the same z-score scale.
mean_variance_demo.py● LIVE
1scores = [60, 65, 70, 72, 75, 78, 80, 85, 90, 95]
2
3# ── Mean ──────────────────────────────────────────
4mean = sum(scores) / len(scores)
5print(f"Mean: {mean}") # 77.0
6
7# ── Variance (avg squared deviation from mean) ────
8variance = sum((x - mean)**2 for x in scores) / len(scores)
9print(f"Variance: {variance:.1f}") # 126.0
10
11# ── Std Dev (sqrt of variance — same units as data)
12std_dev = variance ** 0.5
13print(f"Std Dev: {std_dev:.1f}") # 11.2
14
15# ── Z-score: how far is 95 from the mean? ─────────
16z = (95 - mean) / std_dev
17print(f"Z-score of 95: {z:.2f} std above mean") # 1.60
18
19# ── ML: StandardScaler does this for every feature ─
20scaled = [(x - mean) / std_dev for x in scores]
21print(f"Scaled[0]: {scaled[0]:.2f}") # -1.52 (60 is below mean)
Mean → Variance → Std Dev → Z-score → StandardScaler. This is the exact sequence sklearn uses. StandardScaler().fit_transform(X) runs lines 4, 8, 12, and 20 for every feature column simultaneously. After scaling, every feature has mean=0 and std=1.
Chapter 3 of 4
03
Probability
The language of uncertainty. Every classification model output is a probability. Understanding this means understanding what your model is actually saying.
Understanding Probability
Probability measures how likely an event is — a number between 0 and 1. Zero means impossible. One means certain. Everything in between is uncertainty.
🎲
P(event) = favourable / total
Count how many ways an event can happen, divide by all possible outcomes. Result is always between 0 and 1. Multiply by 100 for a percentage.
P(pass) = students_who_passed / total
P(pass) = 80 / 100 = 0.80 = 80%

P(email is spam) = 0.95 → 95% likely spam
P(rain tomorrow) = 0.30 → 30% chance
ML Connection: When Logistic Regression outputs 0.87 for "spam", it means 87% probability this is spam. Every classifier outputs probabilities — not just yes or no. You choose a threshold to convert to a decision.
📋
Conditional Probability
P(A|B) = probability of A given B has happened. Foundation of Naive Bayes classifiers used for spam detection and text classification.
P(spam | contains "prize") = 0.92
P(pass | attendance > 80%) = 0.88
🔔
Normal Distribution
Most natural data follows a bell curve — values cluster near the mean. Many ML algorithms assume normally distributed features as input.
68% within 1 std dev
95% within 2 std dev
99.7% within 3 std dev
Probability Analogy
Synkoc Instructor Analogy
"Every morning you check the weather forecast. It says 70% chance of rain. That 70% is a probability — it does not guarantee rain, it tells you how confident the model is based on historical patterns. Machine learning classification works identically. Your spam detector does not say 'this is definitely spam' — it says 'I am 94% confident this is spam based on patterns from thousands of examples'. Probability is how your model expresses confidence. You set a threshold to convert that confidence into a decision."
🤖
In Every Model You Build at Synkoc
In Week 3, model.predict_proba(X_test) returns a probability for every prediction. You choose a threshold — typically 0.5 — above which you classify as positive. Raise it to 0.9 and you only flag high-confidence predictions. Lower it to 0.3 and you cast a wider net. Understanding probability means you can tune this intelligently.
Types of Probability — Joint, Marginal, Conditional
Three types of probability appear constantly in machine learning. Understanding each one is essential for understanding how classifiers make decisions.
📊
Joint Probability P(A and B)
The probability that BOTH events happen together. In ML: the probability that a customer is both young AND buys premium.
# Out of 1000 customers:
# Young: 400 Premium: 200
# Young AND Premium: 80

p_young = 400 / 1000 # 0.40
p_premium = 200 / 1000 # 0.20
p_both = 80 / 1000 # 0.08

print(f"P(young and premium) = {p_both}")
🎯
Conditional Probability P(A|B)
The probability of A GIVEN THAT B has already happened. Read the | as "given". In ML: the probability of spam GIVEN the email contains "free money".
# P(premium | young) =
# P(young AND premium) / P(young)
# = 0.08 / 0.40 = 0.20

p_premium_given_young = p_both / p_young
print(f"P(premium|young) = {p_premium_given_young}")
# 20% of young customers buy premium
ML Connection: Every classifier outputs conditional probability. When a spam filter says 95% probability of spam, it means P(spam | this email's features) = 0.95. Naive Bayes classifier is built entirely on conditional probability. Logistic Regression outputs conditional probability using the sigmoid function.
Bayes Theorem — Update Beliefs with Evidence
The most important equation in machine learning after the chain rule. Bayes theorem tells you how to update your probability estimate when you get new evidence.
💡 Real Life Analogy — The Medical Test
You test positive for a rare disease that affects 1 in 1000 people. The test is 99% accurate. What is the probability you actually have the disease? Your intuition says 99%. Bayes theorem says only about 9%. Why? Because the disease is so rare that most positive tests are false positives. Bayes theorem lets you combine your prior knowledge — the disease is rare — with the new evidence — the test result — to get the correct updated probability. This is exactly what a Naive Bayes spam classifier does with every email it sees.
🧠
P(A|B) = P(B|A) × P(A) / P(B)
P(A) = Prior — your belief before seeing evidence. P(B|A) = Likelihood — probability of evidence given hypothesis. P(B) = Marginal — total probability of the evidence. P(A|B) = Posterior — updated belief after seeing evidence.
# Spam classifier using Bayes theorem
p_spam = 0.30 # Prior: 30% of emails are spam
p_free_spam = 0.80 # "free" appears in 80% of spam
p_free = 0.25 # "free" appears in 25% of all emails

# P(spam | email contains "free")
p_spam_given_free = (p_free_spam * p_spam) / p_free
print(f"P(spam|free) = {p_spam_given_free:.2f}") # 0.96
ML Connection: This is the exact calculation a Naive Bayes classifier runs for every word in every email. In Week 3 you will implement a spam classifier using sklearn's MultinomialNB which applies this exact formula to classify emails in milliseconds.
Chapter 4 of 4
04
Correlation
How do two variables move together? Correlation reveals feature relationships — the foundation of feature selection in machine learning.
Understanding Correlation
Correlation measures the strength and direction of the linear relationship between two variables. Pearson r ranges from -1 to +1.
🔗
r = -1 to 0 to +1
r = +1: perfect positive — as X increases, Y increases. r = -1: perfect negative — as X increases, Y decreases. r = 0: no linear relationship. Use |r| for strength regardless of direction.
r = +0.92 → Strong positive (study hours vs score)
r = -0.78 → Strong negative (absences vs grade)
r = +0.12 → Weak — likely noise, consider removing
r = 0.00 → No relationship at all
Feature Selection: Before training, compute r between every feature and the target. Keep features with |r| > 0.5. Remove features near 0 — they add noise, not signal. Also remove redundant features that are highly correlated with each other.
correlation_demo.py● LIVE
1# Pearson Correlation Coefficient from scratch
2study_hours = [2, 3, 4, 5, 6, 7, 8]
3exam_scores = [50, 58, 65, 72, 78, 85, 91]
4
5def pearson_r(x, y):
6 n = len(x)
7 mx = sum(x) / n ; my = sum(y) / n
8 num = sum((xi-mx)*(yi-my) for xi,yi in zip(x,y))
9 sx = (sum((xi-mx)**2 for xi in x))**0.5
10 sy = (sum((yi-my)**2 for yi in y))**0.5
11 return round(num / (sx * sy), 3)
12
13r = pearson_r(study_hours, exam_scores)
14print(f"Pearson r = {r}") # 0.999 — very strong positive
15
16# Interpretation
17if r > 0.7: print("Strong positive correlation")
18elif r > 0.4: print("Moderate correlation")
19elif r > 0.0: print("Weak positive correlation")
20else: print("Negative correlation")
This is EDA Step 1 in every ML project. In Week 2: df.corr() runs this formula for every feature pair simultaneously. Features with r > 0.7 with your target = strong predictors to keep. Features with r > 0.9 with each other = redundant, drop one. This one analysis guides your entire feature selection strategy.
Correlation ≠ Causation — The Most Important Warning in Data Science
Two variables can be highly correlated without one causing the other. Confusing correlation with causation is one of the most costly mistakes in data science.
💡 Famous Spurious Correlations
Ice cream sales and drowning deaths are highly correlated — but eating ice cream does not cause drowning. The hidden cause is summer heat, which increases both. Nicolas Cage movies released per year correlates 0.67 with swimming pool drownings. Countries with more televisions per household have lower birth rates — but giving people TVs does not reduce births. In each case, a third variable — called a confounding variable — causes both. In ML, if you build a model on spurious correlations it will fail catastrophically on new data.
🟢
True Causation
A causes B when changing A directly produces a change in B, all else being equal. Only established through controlled experiments — A/B tests in ML.
# Causal: study hours → exam score
# Experiment: randomly assign
# students to study 2h vs 4h
# Measure score difference
# → Controlled experiment proves causation

study_2h = [65, 68, 70, 72, 74]
study_4h = [78, 80, 82, 85, 88]
diff = sum(study_4h)/5 - sum(study_2h)/5
print(f"Effect of 2 extra hours: +{diff:.1f}")
🔴
Spurious Correlation
High correlation with no causal link. A third hidden variable causes both. Watch for this in any dataset with many features.
# Spurious: shoe size correlated
# with reading ability in children
# Real cause: AGE causes both
# Older children have bigger feet
# AND read better

# In ML: always ask WHY two
# features are correlated before
# using correlation to select features
correlation_ml.py● LIVE
1# Correlation matrix — used in every EDA in data science
2dataset = {
3 "study_hours": [2,3,4,5,6,7,8],
4 "exam_score": [50,58,65,72,78,85,91],
5 "sleep_hours": [8,7,7,6,6,5,5],
6}
7
8def pearson_r(x, y):
9 n = len(x)
10 mx, my = sum(x)/n, sum(y)/n
11 num = sum((xi-mx)*(yi-my) for xi,yi in zip(x,y))
12 dx = (sum((xi-mx)**2 for xi in x))**0.5
13 dy = (sum((yi-my)**2 for yi in y))**0.5
14 return round(num/(dx*dy), 3)
15
16r1 = pearson_r(dataset["study_hours"], dataset["exam_score"])
17r2 = pearson_r(dataset["sleep_hours"], dataset["exam_score"])
18print(f"study→score: r={r1}") # strong positive
19print(f"sleep→score: r={r2}") # negative correlation
This is EDA step 1 in every ML project. In Week 2 you will do df.corr() which computes this entire correlation matrix in one line. Features with r > 0.8 with the target are strong predictors. Features with r > 0.9 with each other are redundant — keep only one.
All Concepts Together — Mini EDA Report
mini_eda_synkoc.pyComplete EDA
1import math
2hours = [2,4,6,8,10,3,7,5,9,1]
3scores = [35,60,72,88,97,45,82,68,93,30]
4def mean(d): return sum(d)/len(d)
5def std(d): m=mean(d); return math.sqrt(mean([(x-m)**2 for x in d]))
6print("=== EDA Report ===")
7print(f"Hours — Mean:{mean(hours):.1f} | Std:{std(hours):.2f}")
8print(f"Scores — Mean:{mean(scores):.1f} | Std:{std(scores):.2f}")
9passing = [s for s in scores if s >= 60]
10print(f"Pass rate: {len(passing)/len(scores)*100:.1f}%")
This mini EDA computes mean and std dev for both variables and calculates the pass rate as a probability. In Week 2, df.describe() in Pandas produces all these statistics in one line — but now you understand exactly what each number means.
Lesson Summary
You have completed Statistics for Data Science. Here is what you can now do:
📊
Mean, Median, Mode
Calculate all three and know when to use each. Mean for symmetric data, median when outliers exist, mode for categories and class labels.
📏
Variance & Std Dev
Calculate spread from scratch. Understand that the bias-variance tradeoff in ML — overfitting vs underfitting — is built on this exact concept.
🎲
Probability
Understand P(event) = favourable/total. Know that every classifier outputs a probability, and that you choose a threshold to convert it to a decision.
🔗
Correlation
Interpret r from -1 to +1. Use correlation for feature selection. Features with |r| near 0 are noise — remove them before training your ML model.
📊
Week 1 Theory Complete!
Both modules done. Open the Statistics Practical Lab to practise. Complete the lab and quiz, then move on to Week 2: NumPy, Pandas, Data Visualisation & EDA.
✅ Video — Done
✏️ Practical Lab — Next
❓ Quiz — After Lab
Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023
Press ▶ Play to start the lesson with voice narration
0:00 / ~30:00
🔊
1 / 25