Synkoc Data Science Internship · Week 1 · Lesson 2 of 11
Statistics
& Probability
Welcome to Lesson 2 of the Synkoc Data Science Internship. Statistics and Probability. If Python is the tool of a data scientist, statistics is the language. Every machine learning algorithm is a stat...
Descriptive Stats
Distributions
Hypothesis Testing
Correlation
🧑💻
Synkoc Instructor
Data Science Internship · Bangalore
⏱ ~65:00
📗 Lesson 2 of 11
This lesson covers five essential areas of statistics that appear directly in yo
Descriptive statistics for summarising and understanding data — mean, median, mode, variance, standard deviation. Probability distributions — normal, binomial, Poisson — which model the randomness in your data. Hypothesis testing — the framework for deciding if a result is real or just noise — used in A/B testing at every technology company. Correlation — measuring relationships between variables, which drives feature selection. And statistical pitfalls — the ways statistics can mislead and how to avoid them.
Chapter 1 of 5
01
Descriptive Stats
Chapter 1: Descriptive Statistics. The first question about any dataset is: what does it look like? Descriptive statistics answer this question numerically. The mean is the arithme
Here is the critical insight about mean versus median in data science
Imagine salaries at a company: 30000, 35000, 40000, 42000, 45000, 50000, 60000, and one executive earning 2000000. The mean salary is approximately 287,000 — which does not represent a single actual employee. The median salary is 43,500 — which accurately represents a typical employee. This difference appears constantly in real data: house prices, transaction amounts, website visit durations, income data. When you describe a dataset to a stakeholder, always report both mean and median. A large gap between them signals skewness and outliers.
Variance measures how spread out values are from the mean — it is the average of
Standard deviation is the square root of variance — it brings the measurement back to the original units of the data. The z-score transforms any value to units of standard deviations from the mean: z equals x minus mean divided by standard deviation. Z-scores allow comparison across datasets with different scales. A z-score of 2 means the value is 2 standard deviations above the mean — occurring in approximately 2.3 percent of a normal distribution. Z-scores above 3 are typically considered outliers.
Chapter 2 of 5
02
Distributions
Chapter 2: Probability Distributions. A probability distribution describes the likelihood of every possible value in a random variable. Understanding distributions is essential for
The 68-95-99.7 rule is one of the most useful facts in all of statistics
In a normal distribution, 68 percent of values fall within one standard deviation of the mean, 95 percent within two standard deviations, and 99.7 percent within three standard deviations. This means that a value more than three standard deviations from the mean occurs in only 0.3 percent of cases — this is the basis for z-score outlier detection. In practice: if you compute a z-score for every value in a column, any value with an absolute z-score greater than three is a statistical outlier worth investigating.
The Central Limit Theorem is the foundation of inferential statistics
It states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the underlying population distribution. This is profound: even if your data follows a completely non-normal distribution — like transaction amounts which are heavily right-skewed — the mean of many samples from that data will still be normally distributed. This is why so many statistical tests that assume normality still work correctly in practice with large enough samples.
Chapter 3 of 5
03
Hypothesis Testing
Chapter 3: Hypothesis Testing. Hypothesis testing is the framework for making data-driven decisions under uncertainty. The core question: is what I observe in the data real, or cou
The p-value is one of the most misunderstood concepts in all of science
A p-value of 0.05 does NOT mean there is a 5 percent probability that the null hypothesis is true. It means: if the null hypothesis were true, you would see data this extreme in only 5 percent of random experiments. If p is below your significance threshold — typically 0.05 — you reject the null hypothesis and conclude the effect is statistically significant. In industry, A/B testing at companies like Flipkart, Swiggy, and every major technology company is hypothesis testing at scale — running controlled experiments on millions of users to decide whether a new feature truly improves the produc
Chapter 4 of 5
04
Correlation
Chapter 4: Correlation and Relationships. Correlation measures the linear relationship between two variables. The Pearson correlation coefficient ranges from negative 1 to positive
The most important warning in statistics: correlation is not causation
Ice cream sales and drowning rates are strongly correlated — both increase in summer. But eating ice cream does not cause drowning. Hot weather causes both. This is a confounding variable. In data science, finding a correlation between two features does not mean one causes the other. You may observe that customers who use the mobile app more frequently also spend more. But it could be that high-spending customers are more engaged generally, not that the app causes spending. Always think carefully about causal mechanisms before making business recommendations from correlations.
Statistical thinking in practice
When you receive a new dataset, statistics is the first tool you reach for. Compute the mean and median of every numeric column — a large gap reveals skewness. Compute the standard deviation — high std relative to the mean suggests high variability that may make modelling harder. Check for outliers using z-scores. Plot histograms to see the distribution shape. Compute the correlation matrix and rank features by their correlation with the target variable. This statistical exploration, which takes 20 to 30 minutes with Pandas and Seaborn, determines your entire modelling strategy: which features
Lesson 2 complete
Statistics is the foundation everything else in data science is built on. You now understand: descriptive statistics — mean, median, mode, variance, standard deviation. The normal distribution and the 68-95-99.7 rule for outlier detection. The Central Limit Theorem and why it matters for sampling and inference. Hypothesis testing, p-values, and the framework used for A/B testing at every major technology company. Correlation, how to measure it, and the essential warning that correlation is not causation. The practical lab has 6 hands-on exercises computing these statistics from scratch in Pyth