Statistics for Data Science | Week 1 | Synkoc Data Science Internship

Synkoc Data Science Internship · Week 1 · Lesson 2 of 11

Statistics for
Data Science

The mathematics that powers every ML algorithm — Mean, Variance, Probability & Correlation explained clearly with real AI applications.

📊 Mean & Median

📏 Variance

🎲 Probability

🔗 Correlation

🧑‍💻

Synkoc Instructor

AI/ML Professional · Bangalore

⏱ ~55 minutes
🟢 Beginner Friendly

Why Statistics Powers ML

Before a model can learn, we must understand the data it will train on. Statistics gives us the tools to describe, summarise, and find patterns. Every ML algorithm is built on statistical foundations.

📊

Mean

Centre of data. Used in normalisation, loss functions, and model evaluation.

ML: feature scaling, loss

📏

Variance & Std Dev

How spread out data is. The bias-variance tradeoff in ML is built on this concept.

ML: overfitting detection

🎲

Probability

Likelihood of events. Every classification model outputs a probability between 0 and 1.

ML: classification output

🔗

Correlation

How two variables move together. Feature selection relies entirely on correlation analysis.

ML: feature selection

Chapter 1 of 4

Mean, Median & Mode

Three ways to measure the centre of data. The foundation of every summary statistic you will compute on a real dataset.

Mean, Median & Mode

Three measures of central tendency — each answers "what is the typical value?" in a different way. Know all three and when to use each one.

➕

Mean (Average)

Add all values, divide by count. Most common measure. Sensitive to outliers — one extreme value can make it misleading.

scores = [70, 80, 90, 60, 85]
mean = 385 / 5 = 77.0

⚡ ML: np.mean(), loss calculation, normalisation

📍

Median (Middle Value)

Sort, pick the middle value. Not affected by outliers — reliable for skewed data like salaries and house prices.

sorted: [60, 70, 80, 85, 90]
median = 80 (middle value)

⚡ ML: robust imputation, skewed features

🏆

Mode (Most Frequent)

The value that appears most often. The only measure that works on categorical data — labels, colours, categories.

labels = ["A","B","A","C","A"]
mode = "A" (appears 3 times)

⚡ ML: class imbalance, majority baseline

💡

When to use which?

Use mean for symmetric numeric data. Use median when outliers exist. Use mode for categories and class labels.

House prices → MEDIAN
Exam scores → MEAN
Eye colour → MODE

⚡ Always check for outliers first

Median & Mode — When Mean Fails

The mean is not always the best measure. Outliers can destroy it. Median and Mode are the alternatives used in real data science work.

💡 Real Life Analogy — The Salary Problem

Imagine 9 employees earning ₹30,000 per month and 1 CEO earning ₹10,00,000. The mean salary is ₹1,27,000 — but 9 out of 10 people earn far less than that. The mean is misleading because one extreme value — the outlier — pulled it up. The median, which is the middle value when sorted, gives ₹30,000 — which actually represents what a typical employee earns. This is why news reports about income inequality use median salary, not mean salary.

📈

Median — The Middle Value

Sort all values. Take the middle one. If even count, average the two middle values. Not affected by outliers. Use for skewed data like house prices, salaries, income.

scores = [45, 60, 62, 70, 75,
78, 80, 82, 95, 99]
# sorted, 10 values (even count)
# median = (75 + 78) / 2 = 76.5

import statistics
print(statistics.median(scores)) # 76.5

🏆

Mode — The Most Frequent Value

The value that appears most often. Use for categorical data — most common label, most frequent category, most purchased product.

grades = ["A","B","A","C","A",
"B","A","C","B","A"]
# A appears 5 times — most frequent

print(statistics.mode(grades)) # A

# ML: most frequent class label
# used in majority-class classifier

⚡ML Connection: When you call df["salary"].describe() in Pandas you see mean and 50th percentile (median). Pandas uses median for imputing missing values in skewed columns: df.fillna(df.median()). The most common imputation strategy in real ML projects.

Mean vs Median vs Mode — How to Choose

Choosing the wrong measure gives you the wrong insight. Use this decision framework for every dataset you encounter.

⚖️

Use Mean when...

Data is roughly symmetric with no extreme outliers. Works perfectly for exam scores, temperature readings, model accuracy values.

accuracies = [0.87, 0.89, 0.91,
0.88, 0.90, 0.92]
# Symmetric — no outliers
# Mean is appropriate here
mean_acc = sum(accuracies)/len(accuracies)
print(f"Mean accuracy: {mean_acc:.3f}")

⚠️

Use Median when...

Data is skewed or has outliers. House prices, salaries, time spent on a website, transaction amounts — all have outliers. Always use median.

house_prices = [50, 55, 60, 65,
70, 75, 500] # outlier!
# Mean = 125 (misleading!)
# Median = 65 (representative)
import statistics
print(statistics.median(house_prices)) # 65

central_tendency_ml.py

1# In ML: always check for skewness before choosing imputation strategy

2import statistics as stats

3ages = [22, 25, 27, 28, 30, 32, 85] # 85 is an outlier

4print(f"Mean: {stats.mean(ages):.1f}") # 35.6 — pulled by outlier

5print(f"Median: {stats.median(ages):.1f}") # 28.0 — representative

6# Rule: if mean >> median → data is right-skewed → use median for imputation

7# Rule: if mean ≈ median → data is symmetric → mean is fine

Chapter 2 of 4

Variance & Standard Deviation

Measure how spread out data is. The foundation of the bias-variance tradeoff — the most important concept in ML engineering.

Variance & Standard Deviation

Mean tells you the centre. Variance tells you the spread. Two datasets can have the same mean but completely different variance — changing everything in ML.

📏

Variance = Average Squared Distance from Mean

For each value: subtract the mean and square the result. Average all those squared differences. Square root of variance = standard deviation — in the same units as your original data.

scores = [70, 80, 90, 60, 85]
mean = 77.0
diffs = [-7, 3, 13, -17, 8]
sq_diffs = [49, 9, 169, 289, 64]
variance = (49+9+169+289+64) / 5 = 116.0
std_dev = sqrt(116) ≈ 10.77

⚡ML: High model variance = overfitting. Low variance = underfitting. The bias-variance tradeoff is the #1 engineering challenge in machine learning.

What Spread Looks Like

Two classes, same mean score of 75. But their spreads are completely different — and this changes how a model treats them.

Class A — Low Variance

σ² = 12.5 · Consistent results

7274767778

✅ Easy for ML — predictable

Class B — High Variance

σ² = 284 · Highly inconsistent

110

40607590110

⚠️ Harder for ML — noisy data

🤖

Why this matters in your ML projects

In Pandas (Week 2), a column with std dev near zero is useless — every row has almost the same value. A column with very high std dev may need normalisation before training. Always check variance before modelling.

Standard Deviation — Spread in Original Units

Variance is in squared units — hard to interpret. Standard deviation is the square root of variance, giving spread in the original units. This is what you use in practice.

💡 Real Life Analogy — The Tailor

A tailor making school uniforms knows the average height of students is 155 cm. But she also needs to know how much heights vary. Variance would give her something in cm squared — useless for cutting fabric. Standard deviation gives her the answer in centimetres: heights typically vary by plus or minus 8 cm from the mean. She now knows to make sizes from 147 cm to 163 cm to cover most students. Standard deviation gives you the spread in the same units as your data — directly usable.

📏

std = sqrt(variance) — spread in original units

If exam scores have a mean of 70 and a standard deviation of 10, you immediately know: most students scored between 60 and 80. About 68% of all values in a normal distribution fall within one standard deviation of the mean. About 95% fall within two standard deviations.

scores = [60, 65, 68, 70, 72, 75, 78, 80, 85, 90]

mean = sum(scores) / len(scores) # 74.3
var = sum((x - mean)**2 for x in scores) / len(scores)
std = var ** 0.5 # 8.2

print(f"Mean: {mean:.1f} Std: {std:.1f}")
print(f"Typical range: {mean-std:.1f} to {mean+std:.1f}")

⚡ML Connection: StandardScaler in sklearn subtracts the mean and divides by standard deviation for every feature. This is called standardisation. It puts all features on the same scale so that a feature with large values like salary does not dominate a feature with small values like age.

Normal Distribution — The Bell Curve

The most important distribution in statistics and machine learning. Many natural phenomena follow this pattern — heights, exam scores, measurement errors, model prediction errors.

💡 The 68-95-99.7 Rule — Memorise This

In any normal distribution: 68% of values fall within 1 standard deviation of the mean. 95% fall within 2 standard deviations. 99.7% fall within 3 standard deviations. This means if exam scores have mean 70 and std 10: 68% of students scored between 60 and 80. 95% scored between 50 and 90. Only 0.3% scored below 40 or above 100 — the extreme outliers.

📈

Why Normal Distribution Matters in ML

Many ML algorithms assume features are normally distributed. Linear regression assumes normally distributed errors. Naive Bayes assumes normally distributed features. When your data is skewed, you transform it to make it more normal.

import statistics
scores = [55,60,65,68,70,72,
75,78,80,85,90,95]
mean = statistics.mean(scores)
std = statistics.stdev(scores)
print(f"Mean: {mean:.1f}")
print(f"Std: {std:.1f}")

⚖️

Z-Score — How Far from Mean

A z-score tells you how many standard deviations a value is from the mean. Z-score of 0 = at the mean. Z-score of 2 = 2 std above mean (top 2.5%). Used for outlier detection.

def z_score(value, mean, std):
return (value - mean) / std

# score of 95 in a class
# mean=70, std=10
z = z_score(95, 70, 10)
print(f"Z-score: {z}") # 2.5
# 2.5 std above average — top 1%

⚡ML Connection: StandardScaler converts every feature value to its z-score. After scaling, every feature has mean=0 and std=1. This is why standardised features make gradient descent converge much faster in neural networks — all features are on the same z-score scale.

mean_variance_demo.py● LIVE

1scores = [60, 65, 70, 72, 75, 78, 80, 85, 90, 95]

3# ── Mean ──────────────────────────────────────────

4mean = sum(scores) / len(scores)

5print(f"Mean: {mean}") # 77.0

7# ── Variance (avg squared deviation from mean) ────

8variance = sum((x - mean)**2 for x in scores) / len(scores)

9print(f"Variance: {variance:.1f}") # 126.0

11# ── Std Dev (sqrt of variance — same units as data)

12std_dev = variance ** 0.5

13print(f"Std Dev: {std_dev:.1f}") # 11.2

15# ── Z-score: how far is 95 from the mean? ─────────

16z = (95 - mean) / std_dev

17print(f"Z-score of 95: {z:.2f} std above mean") # 1.60

19# ── ML: StandardScaler does this for every feature ─

20scaled = [(x - mean) / std_dev for x in scores]

21print(f"Scaled[0]: {scaled[0]:.2f}") # -1.52 (60 is below mean)

Mean → Variance → Std Dev → Z-score → StandardScaler. This is the exact sequence sklearn uses. StandardScaler().fit_transform(X) runs lines 4, 8, 12, and 20 for every feature column simultaneously. After scaling, every feature has mean=0 and std=1.

Chapter 3 of 4

Probability

The language of uncertainty. Every classification model output is a probability. Understanding this means understanding what your model is actually saying.

Understanding Probability

Probability measures how likely an event is — a number between 0 and 1. Zero means impossible. One means certain. Everything in between is uncertainty.

🎲

P(event) = favourable / total

Count how many ways an event can happen, divide by all possible outcomes. Result is always between 0 and 1. Multiply by 100 for a percentage.

P(pass) = students_who_passed / total
P(pass) = 80 / 100 = 0.80 = 80%

P(email is spam) = 0.95 → 95% likely spam
P(rain tomorrow) = 0.30 → 30% chance

⚡ML Connection: When Logistic Regression outputs 0.87 for "spam", it means 87% probability this is spam. Every classifier outputs probabilities — not just yes or no. You choose a threshold to convert to a decision.

📋

Conditional Probability

P(A|B) = probability of A given B has happened. Foundation of Naive Bayes classifiers used for spam detection and text classification.

P(spam | contains "prize") = 0.92
P(pass | attendance > 80%) = 0.88

🔔

Normal Distribution

Most natural data follows a bell curve — values cluster near the mean. Many ML algorithms assume normally distributed features as input.

68% within 1 std dev
95% within 2 std dev
99.7% within 3 std dev

Probability Analogy

Synkoc Instructor Analogy

"Every morning you check the weather forecast. It says 70% chance of rain. That 70% is a probability — it does not guarantee rain, it tells you how confident the model is based on historical patterns. Machine learning classification works identically. Your spam detector does not say 'this is definitely spam' — it says 'I am 94% confident this is spam based on patterns from thousands of examples'. Probability is how your model expresses confidence. You set a threshold to convert that confidence into a decision."

🤖

In Every Model You Build at Synkoc

In Week 3, model.predict_proba(X_test) returns a probability for every prediction. You choose a threshold — typically 0.5 — above which you classify as positive. Raise it to 0.9 and you only flag high-confidence predictions. Lower it to 0.3 and you cast a wider net. Understanding probability means you can tune this intelligently.

Types of Probability — Joint, Marginal, Conditional

Three types of probability appear constantly in machine learning. Understanding each one is essential for understanding how classifiers make decisions.

📊

Joint Probability P(A and B)

The probability that BOTH events happen together. In ML: the probability that a customer is both young AND buys premium.

# Out of 1000 customers:
# Young: 400 Premium: 200
# Young AND Premium: 80

p_young = 400 / 1000 # 0.40
p_premium = 200 / 1000 # 0.20
p_both = 80 / 1000 # 0.08

print(f"P(young and premium) = {p_both}")

🎯

Conditional Probability P(A|B)

The probability of A GIVEN THAT B has already happened. Read the | as "given". In ML: the probability of spam GIVEN the email contains "free money".

# P(premium | young) =
# P(young AND premium) / P(young)
# = 0.08 / 0.40 = 0.20

p_premium_given_young = p_both / p_young
print(f"P(premium|young) = {p_premium_given_young}")
# 20% of young customers buy premium

⚡ML Connection: Every classifier outputs conditional probability. When a spam filter says 95% probability of spam, it means P(spam | this email's features) = 0.95. Naive Bayes classifier is built entirely on conditional probability. Logistic Regression outputs conditional probability using the sigmoid function.

Bayes Theorem — Update Beliefs with Evidence

The most important equation in machine learning after the chain rule. Bayes theorem tells you how to update your probability estimate when you get new evidence.

💡 Real Life Analogy — The Medical Test

You test positive for a rare disease that affects 1 in 1000 people. The test is 99% accurate. What is the probability you actually have the disease? Your intuition says 99%. Bayes theorem says only about 9%. Why? Because the disease is so rare that most positive tests are false positives. Bayes theorem lets you combine your prior knowledge — the disease is rare — with the new evidence — the test result — to get the correct updated probability. This is exactly what a Naive Bayes spam classifier does with every email it sees.

🧠

P(A|B) = P(B|A) × P(A) / P(B)

P(A) = Prior — your belief before seeing evidence. P(B|A) = Likelihood — probability of evidence given hypothesis. P(B) = Marginal — total probability of the evidence. P(A|B) = Posterior — updated belief after seeing evidence.

# Spam classifier using Bayes theorem
p_spam = 0.30 # Prior: 30% of emails are spam
p_free_spam = 0.80 # "free" appears in 80% of spam
p_free = 0.25 # "free" appears in 25% of all emails

# P(spam | email contains "free")
p_spam_given_free = (p_free_spam * p_spam) / p_free
print(f"P(spam|free) = {p_spam_given_free:.2f}") # 0.96

⚡ML Connection: This is the exact calculation a Naive Bayes classifier runs for every word in every email. In Week 3 you will implement a spam classifier using sklearn's MultinomialNB which applies this exact formula to classify emails in milliseconds.

Chapter 4 of 4

Correlation

How do two variables move together? Correlation reveals feature relationships — the foundation of feature selection in machine learning.

Understanding Correlation

Correlation measures the strength and direction of the linear relationship between two variables. Pearson r ranges from -1 to +1.

🔗

r = -1 to 0 to +1

r = +1: perfect positive — as X increases, Y increases. r = -1: perfect negative — as X increases, Y decreases. r = 0: no linear relationship. Use |r| for strength regardless of direction.

r = +0.92 → Strong positive (study hours vs score)
r = -0.78 → Strong negative (absences vs grade)
r = +0.12 → Weak — likely noise, consider removing
r = 0.00 → No relationship at all

⚡Feature Selection: Before training, compute r between every feature and the target. Keep features with |r| > 0.5. Remove features near 0 — they add noise, not signal. Also remove redundant features that are highly correlated with each other.

correlation_demo.py● LIVE

1# Pearson Correlation Coefficient from scratch

2study_hours = [2, 3, 4, 5, 6, 7, 8]

3exam_scores = [50, 58, 65, 72, 78, 85, 91]

5def pearson_r(x, y):

6 n = len(x)

7 mx = sum(x) / n ; my = sum(y) / n

8 num = sum((xi-mx)*(yi-my) for xi,yi in zip(x,y))

9 sx = (sum((xi-mx)**2 for xi in x))**0.5

10 sy = (sum((yi-my)**2 for yi in y))**0.5

11 return round(num / (sx * sy), 3)

13r = pearson_r(study_hours, exam_scores)

14print(f"Pearson r = {r}") # 0.999 — very strong positive

16# Interpretation

17if r > 0.7: print("Strong positive correlation")

18elif r > 0.4: print("Moderate correlation")

19elif r > 0.0: print("Weak positive correlation")

20else: print("Negative correlation")

This is EDA Step 1 in every ML project. In Week 2: df.corr() runs this formula for every feature pair simultaneously. Features with r > 0.7 with your target = strong predictors to keep. Features with r > 0.9 with each other = redundant, drop one. This one analysis guides your entire feature selection strategy.

Correlation ≠ Causation — The Most Important Warning in Data Science

Two variables can be highly correlated without one causing the other. Confusing correlation with causation is one of the most costly mistakes in data science.

💡 Famous Spurious Correlations

Ice cream sales and drowning deaths are highly correlated — but eating ice cream does not cause drowning. The hidden cause is summer heat, which increases both. Nicolas Cage movies released per year correlates 0.67 with swimming pool drownings. Countries with more televisions per household have lower birth rates — but giving people TVs does not reduce births. In each case, a third variable — called a confounding variable — causes both. In ML, if you build a model on spurious correlations it will fail catastrophically on new data.

🟢

True Causation

A causes B when changing A directly produces a change in B, all else being equal. Only established through controlled experiments — A/B tests in ML.

# Causal: study hours → exam score
# Experiment: randomly assign
# students to study 2h vs 4h
# Measure score difference
# → Controlled experiment proves causation

study_2h = [65, 68, 70, 72, 74]
study_4h = [78, 80, 82, 85, 88]
diff = sum(study_4h)/5 - sum(study_2h)/5
print(f"Effect of 2 extra hours: +{diff:.1f}")

🔴

Spurious Correlation

High correlation with no causal link. A third hidden variable causes both. Watch for this in any dataset with many features.

# Spurious: shoe size correlated
# with reading ability in children
# Real cause: AGE causes both
# Older children have bigger feet
# AND read better

# In ML: always ask WHY two
# features are correlated before
# using correlation to select features

correlation_ml.py● LIVE

1# Correlation matrix — used in every EDA in data science

2dataset = {

3 "study_hours": [2,3,4,5,6,7,8],

4 "exam_score": [50,58,65,72,78,85,91],

5 "sleep_hours": [8,7,7,6,6,5,5],

8def pearson_r(x, y):

9 n = len(x)

10 mx, my = sum(x)/n, sum(y)/n

11 num = sum((xi-mx)*(yi-my) for xi,yi in zip(x,y))

12 dx = (sum((xi-mx)**2 for xi in x))**0.5

13 dy = (sum((yi-my)**2 for yi in y))**0.5

14 return round(num/(dx*dy), 3)

16r1 = pearson_r(dataset["study_hours"], dataset["exam_score"])

17r2 = pearson_r(dataset["sleep_hours"], dataset["exam_score"])

18print(f"study→score: r={r1}") # strong positive

19print(f"sleep→score: r={r2}") # negative correlation

This is EDA step 1 in every ML project. In Week 2 you will do df.corr() which computes this entire correlation matrix in one line. Features with r > 0.8 with the target are strong predictors. Features with r > 0.9 with each other are redundant — keep only one.

All Concepts Together — Mini EDA Report

mini_eda_synkoc.pyComplete EDA

1import math

2hours = [2,4,6,8,10,3,7,5,9,1]

3scores = [35,60,72,88,97,45,82,68,93,30]

4def mean(d): return sum(d)/len(d)

5def std(d): m=mean(d); return math.sqrt(mean([(x-m)**2 for x in d]))

6print("=== EDA Report ===")

7print(f"Hours — Mean:{mean(hours):.1f} | Std:{std(hours):.2f}")

8print(f"Scores — Mean:{mean(scores):.1f} | Std:{std(scores):.2f}")

9passing = [s for s in scores if s >= 60]

10print(f"Pass rate: {len(passing)/len(scores)*100:.1f}%")

This mini EDA computes mean and std dev for both variables and calculates the pass rate as a probability. In Week 2, df.describe() in Pandas produces all these statistics in one line — but now you understand exactly what each number means.

Lesson Summary

You have completed Statistics for Data Science. Here is what you can now do:

📊

Mean, Median, Mode

Calculate all three and know when to use each. Mean for symmetric data, median when outliers exist, mode for categories and class labels.

📏

Variance & Std Dev

Calculate spread from scratch. Understand that the bias-variance tradeoff in ML — overfitting vs underfitting — is built on this exact concept.

🎲

Probability

Understand P(event) = favourable/total. Know that every classifier outputs a probability, and that you choose a threshold to convert it to a decision.

🔗

Correlation

Interpret r from -1 to +1. Use correlation for feature selection. Features with |r| near 0 are noise — remove them before training your ML model.

📊

Week 1 Theory Complete!

Both modules done. Open the Statistics Practical Lab to practise. Complete the lab and quiz, then move on to Week 2: NumPy, Pandas, Data Visualisation & EDA.

✅ Video — Done

✏️ Practical Lab — Next

❓ Quiz — After Lab

Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023