Descriptive Statistics#
Data Representation Revision#
Theory#
Descriptive statistics builds upon data representation concepts, using organized data displays to calculate summary measures that describe the main features of a dataset. Understanding data organization is essential for accurate statistical calculations.
where \(x_i\) represents values/categories and \(f_i\) represents their frequencies.
Application#
Examples#
Example 1: Calculating Statistics from Frequency Tables#
Let’s calculate basic summary statistics from grouped data.
Data: Test scores with frequencies: 60-69 (3), 70-79 (7), 80-89 (12), 90-99 (8)
Method 1: Find Modal Class and Total
\(\text{Total students} = 3 + 7 + 12 + 8 = 30 \quad \text{(sum all frequencies)}\)
\(\text{Modal class} = 80-89 \quad \text{(highest frequency of 12)}\)
\(\text{Relative frequency of modal class} = \frac{12}{30} = 0.4 \quad \text{(40\% of data)}\)
Interactive Visualization: Frequency Distribution Explorer#
Multiple Choice Questions#
Descriptive Statistics#
Theory#
Foundational Definitions: Descriptive statistics summarize and describe the main features of a dataset through numerical measures and graphical representations. These statistics provide insight into the center, spread, and shape of data distributions.
Measures of Central Tendency:
Mean (Arithmetic Average): The sum of all values divided by the number of values
• Properties: Affected by extreme values (outliers) • Uses all data values in calculation • For grouped data: \(\bar{x} = \frac{\sum f_i x_i}{\sum f_i}\) where \(x_i\) is class midpoint • Weighted mean: \(\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i}\) where \(w_i\) are weights
Median: The middle value when data is arranged in order
• For odd n: Median = value at position \(\frac{n+1}{2}\) • For even n: Median = average of values at positions \(\frac{n}{2}\) and \(\frac{n}{2} + 1\) • Properties: Not affected by extreme values • For grouped data: \(\text{Median} = L + \frac{\frac{n}{2} - F}{f} \times h\) where L = lower boundary of median class, F = cumulative frequency before median class, f = frequency of median class, h = class width
Mode: The most frequently occurring value(s)
• Can have no mode, one mode (unimodal), or multiple modes (bimodal, multimodal) • For grouped data: Modal class has highest frequency • Estimated mode: \(\text{Mode} = L + \frac{d_1}{d_1 + d_2} \times h\) where \(d_1\) = difference with previous class frequency, \(d_2\) = difference with next class frequency
Measures of Spread (Dispersion):
Range: The difference between maximum and minimum values
• Simple but affected by extreme values • Interquartile range (IQR): \(IQR = Q_3 - Q_1\) (more robust)
Variance: Average squared deviation from the mean
Population variance: $\(\sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n}\)$
Sample variance: $\(s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}\)$
• Alternative formula: \(\sigma^2 = \frac{\sum x_i^2}{n} - \mu^2\) • For grouped data: \(\sigma^2 = \frac{\sum f_i(x_i - \bar{x})^2}{\sum f_i}\)
Standard Deviation: Square root of variance
• Properties: Same units as original data • Approximately 68% of data within 1 standard deviation of mean (normal distribution) • Approximately 95% within 2 standard deviations • Approximately 99.7% within 3 standard deviations
Measures of Position:
Quartiles: Divide ordered data into four equal parts
• \(Q_1\) (First quartile): 25th percentile • \(Q_2\) (Second quartile): 50th percentile (median) • \(Q_3\) (Third quartile): 75th percentile • Position formula: \(Q_k\) position = \(\frac{k(n+1)}{4}\)
Percentiles: Divide data into 100 equal parts
Box Plots (Box-and-Whisker Plots): Visual summary of five-number summary
• Minimum, \(Q_1\), Median, \(Q_3\), Maximum • Shows outliers: values beyond \(Q_1 - 1.5 \times IQR\) or \(Q_3 + 1.5 \times IQR\) • Indicates skewness through asymmetry
Coefficient of Variation: Relative measure of variability
• Useful for comparing variability between datasets with different units or scales • Dimensionless measure
Interactive Visualization: Descriptive Statistics Explorer#
Application#
Examples#
Example 1: Calculating Mean, Median, and Mode#
Solve: Find the mean, median, and mode for the dataset: 12, 15, 18, 15, 22, 15, 19, 20, 18, 25
Method 1: Direct Calculation
\(\text{Ordered data: } 12, 15, 15, 15, 18, 18, 19, 20, 22, 25 \quad \text{(arrange in order)}\)
\(\bar{x} = \frac{12 + 15 + 15 + 15 + 18 + 18 + 19 + 20 + 22 + 25}{10} = \frac{179}{10} = 17.9 \quad \text{(mean)}\)
\(\text{Median} = \frac{18 + 18}{2} = 18 \quad \text{(average of 5th and 6th values)}\)
\(\text{Mode} = 15 \quad \text{(appears 3 times, most frequent)}\)
Example 2: Computing Standard Deviation#
Solve: Calculate the population standard deviation for: 4, 7, 9, 10, 14
Method 1: Definition Formula
\(\mu = \frac{4 + 7 + 9 + 10 + 14}{5} = \frac{44}{5} = 8.8 \quad \text{(calculate mean first)}\)
\(\sum(x_i - \mu)^2 = (4-8.8)^2 + (7-8.8)^2 + (9-8.8)^2 + (10-8.8)^2 + (14-8.8)^2\)
\(= 23.04 + 3.24 + 0.04 + 1.44 + 27.04 = 54.8 \quad \text{(sum of squared deviations)}\)
\(\sigma = \sqrt{\frac{54.8}{5}} = \sqrt{10.96} = 3.31 \quad \text{(population standard deviation)}\)
Method 2: Alternative Formula
\(\sum x_i^2 = 16 + 49 + 81 + 100 + 196 = 442 \quad \text{(sum of squares)}\)
\(\sigma^2 = \frac{442}{5} - (8.8)^2 = 88.4 - 77.44 = 10.96 \quad \text{(variance)}\)
\(\sigma = \sqrt{10.96} = 3.31 \quad \text{(confirms our result)}\)
Example 3: Finding Quartiles and Creating Box Plot#
Solve: Find the five-number summary for: 23, 25, 28, 30, 32, 35, 38, 40, 42, 45, 48, 52, 55
Method 1: Position Formula
\(n = 13 \quad \text{(count data points)}\)
\(Q_1 \text{ position} = \frac{1(13+1)}{4} = 3.5 \quad \text{(between 3rd and 4th values)}\)
\(Q_1 = \frac{28 + 30}{2} = 29 \quad \text{(interpolate)}\)
\(Q_2 \text{ (median) position} = \frac{2(13+1)}{4} = 7 \quad \text{(7th value)}\)
\(Q_2 = 38 \quad \text{(middle value)}\)
\(Q_3 \text{ position} = \frac{3(13+1)}{4} = 10.5 \quad \text{(between 10th and 11th values)}\)
\(Q_3 = \frac{45 + 48}{2} = 46.5 \quad \text{(interpolate)}\)
\(\text{Five-number summary: Min}=23, Q_1=29, \text{Median}=38, Q_3=46.5, \text{Max}=55\)
Multiple Choice Questions#
Sector Specific Questions: Descriptive Statistics Applications#
Key Takeaways#
Important
Mean, median, and mode: Each measures center differently - choose based on data distribution and outliers
Standard deviation: Measures typical distance from the mean; use with mean for symmetric data
Quartiles and IQR: Robust measures unaffected by outliers; use with median for skewed data
Variance uses squared units: Always take square root for standard deviation to match data units
Coefficient of variation: Enables comparison of variability across different scales or units
Box plots: Visualize five-number summary and identify outliers using 1.5×IQR rule
Sample vs population: Use n-1 for sample variance/standard deviation, n for population
Grouped data formulas: Use class midpoints and frequencies when raw data unavailable