Descriptive Statistics#

Data Representation Revision#

Theory#

Descriptive statistics builds upon data representation concepts, using organized data displays to calculate summary measures that describe the main features of a dataset. Understanding data organization is essential for accurate statistical calculations.

\[\text{Frequency Distribution} = \{(x_i, f_i) : i = 1, 2, ..., k\}\]

where $x_i$ represents values/categories and $f_i$ represents their frequencies.

\[\text{Total Frequency} = n = \sum_{i=1}^{k} f_i\]

Application#

Examples#

Example 1: Calculating Statistics from Frequency Tables#

Let’s calculate basic summary statistics from grouped data.

Data: Test scores with frequencies: 60-69 (3), 70-79 (7), 80-89 (12), 90-99 (8)

Method 1: Find Modal Class and Total

$\text{Total students} = 3 + 7 + 12 + 8 = 30 \quad \text{(sum all frequencies)}$

$\text{Modal class} = 80-89 \quad \text{(highest frequency of 12)}$

$\text{Relative frequency of modal class} = \frac{12}{30} = 0.4 \quad \text{(40\% of data)}$

Interactive Visualization: Frequency Distribution Explorer#

Interactive Graph

Frequency distribution and statistical measures visualization will be implemented here

Multiple Choice Questions#

Descriptive Statistics#

Theory#

Foundational Definitions: Descriptive statistics summarize and describe the main features of a dataset through numerical measures and graphical representations. These statistics provide insight into the center, spread, and shape of data distributions.

Measures of Central Tendency:

Mean (Arithmetic Average): The sum of all values divided by the number of values

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}\]

• Properties: Affected by extreme values (outliers) • Uses all data values in calculation • For grouped data: $\bar{x} = \frac{\sum f_i x_i}{\sum f_i}$ where $x_i$ is class midpoint • Weighted mean: $\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i}$ where $w_i$ are weights

Median: The middle value when data is arranged in order

• For odd n: Median = value at position $\frac{n+1}{2}$ • For even n: Median = average of values at positions $\frac{n}{2}$ and $\frac{n}{2} + 1$ • Properties: Not affected by extreme values • For grouped data: $\text{Median} = L + \frac{\frac{n}{2} - F}{f} \times h$ where L = lower boundary of median class, F = cumulative frequency before median class, f = frequency of median class, h = class width

Mode: The most frequently occurring value(s)

• Can have no mode, one mode (unimodal), or multiple modes (bimodal, multimodal) • For grouped data: Modal class has highest frequency • Estimated mode: $\text{Mode} = L + \frac{d_1}{d_1 + d_2} \times h$ where $d_1$ = difference with previous class frequency, $d_2$ = difference with next class frequency

Measures of Spread (Dispersion):

Range: The difference between maximum and minimum values

\[\text{Range} = x_{max} - x_{min}\]

• Simple but affected by extreme values • Interquartile range (IQR): $IQR = Q_3 - Q_1$ (more robust)

Variance: Average squared deviation from the mean

Population variance: $$\sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n}$$

Sample variance: $$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$$

• Alternative formula: $\sigma^2 = \frac{\sum x_i^2}{n} - \mu^2$ • For grouped data: $\sigma^2 = \frac{\sum f_i(x_i - \bar{x})^2}{\sum f_i}$

Standard Deviation: Square root of variance

\[\sigma = \sqrt{\sigma^2} \quad \text{(population)}\]

\[s = \sqrt{s^2} \quad \text{(sample)}\]

• Properties: Same units as original data • Approximately 68% of data within 1 standard deviation of mean (normal distribution) • Approximately 95% within 2 standard deviations • Approximately 99.7% within 3 standard deviations

Measures of Position:

Quartiles: Divide ordered data into four equal parts

• $Q_1$ (First quartile): 25th percentile • $Q_2$ (Second quartile): 50th percentile (median) • $Q_3$ (Third quartile): 75th percentile • Position formula: $Q_k$ position = $\frac{k(n+1)}{4}$

Percentiles: Divide data into 100 equal parts

\[P_k \text{ position} = \frac{k(n+1)}{100}\]

Box Plots (Box-and-Whisker Plots): Visual summary of five-number summary

• Minimum, $Q_1$, Median, $Q_3$, Maximum • Shows outliers: values beyond $Q_1 - 1.5 \times IQR$ or $Q_3 + 1.5 \times IQR$ • Indicates skewness through asymmetry

Coefficient of Variation: Relative measure of variability

\[CV = \frac{s}{\bar{x}} \times 100\%\]

• Useful for comparing variability between datasets with different units or scales • Dimensionless measure

Interactive Visualization: Descriptive Statistics Explorer#

Interactive Graph

Interactive descriptive statistics calculator and visualization will be implemented here

Application#

Examples#

Example 1: Calculating Mean, Median, and Mode#

Solve: Find the mean, median, and mode for the dataset: 12, 15, 18, 15, 22, 15, 19, 20, 18, 25

Method 1: Direct Calculation

$\text{Ordered data: } 12, 15, 15, 15, 18, 18, 19, 20, 22, 25 \quad \text{(arrange in order)}$

$\bar{x} = \frac{12 + 15 + 15 + 15 + 18 + 18 + 19 + 20 + 22 + 25}{10} = \frac{179}{10} = 17.9 \quad \text{(mean)}$

$\text{Median} = \frac{18 + 18}{2} = 18 \quad \text{(average of 5th and 6th values)}$

$\text{Mode} = 15 \quad \text{(appears 3 times, most frequent)}$

Example 2: Computing Standard Deviation#

Solve: Calculate the population standard deviation for: 4, 7, 9, 10, 14

Method 1: Definition Formula

$\mu = \frac{4 + 7 + 9 + 10 + 14}{5} = \frac{44}{5} = 8.8 \quad \text{(calculate mean first)}$

$\sum(x_i - \mu)^2 = (4-8.8)^2 + (7-8.8)^2 + (9-8.8)^2 + (10-8.8)^2 + (14-8.8)^2$

$= 23.04 + 3.24 + 0.04 + 1.44 + 27.04 = 54.8 \quad \text{(sum of squared deviations)}$

$\sigma = \sqrt{\frac{54.8}{5}} = \sqrt{10.96} = 3.31 \quad \text{(population standard deviation)}$

Method 2: Alternative Formula

$\sum x_i^2 = 16 + 49 + 81 + 100 + 196 = 442 \quad \text{(sum of squares)}$

$\sigma^2 = \frac{442}{5} - (8.8)^2 = 88.4 - 77.44 = 10.96 \quad \text{(variance)}$

$\sigma = \sqrt{10.96} = 3.31 \quad \text{(confirms our result)}$

Example 3: Finding Quartiles and Creating Box Plot#

Solve: Find the five-number summary for: 23, 25, 28, 30, 32, 35, 38, 40, 42, 45, 48, 52, 55

Method 1: Position Formula

$n = 13 \quad \text{(count data points)}$

$Q_1 \text{ position} = \frac{1(13+1)}{4} = 3.5 \quad \text{(between 3rd and 4th values)}$

$Q_1 = \frac{28 + 30}{2} = 29 \quad \text{(interpolate)}$

$Q_2 \text{ (median) position} = \frac{2(13+1)}{4} = 7 \quad \text{(7th value)}$

$Q_2 = 38 \quad \text{(middle value)}$

$Q_3 \text{ position} = \frac{3(13+1)}{4} = 10.5 \quad \text{(between 10th and 11th values)}$

$Q_3 = \frac{45 + 48}{2} = 46.5 \quad \text{(interpolate)}$

$\text{Five-number summary: Min}=23, Q_1=29, \text{Median}=38, Q_3=46.5, \text{Max}=55$

Multiple Choice Questions#

Sector Specific Questions: Descriptive Statistics Applications#

Key Takeaways#

Important

Mean, median, and mode: Each measures center differently - choose based on data distribution and outliers
Standard deviation: Measures typical distance from the mean; use with mean for symmetric data
Quartiles and IQR: Robust measures unaffected by outliers; use with median for skewed data
Variance uses squared units: Always take square root for standard deviation to match data units
Coefficient of variation: Enables comparison of variability across different scales or units
Box plots: Visualize five-number summary and identify outliers using 1.5×IQR rule
Sample vs population: Use n-1 for sample variance/standard deviation, n for population
Grouped data formulas: Use class midpoints and frequencies when raw data unavailable