Descriptive Statistics#

Data Representation Revision#

Theory#

Descriptive statistics builds upon data representation concepts, using organized data displays to calculate summary measures that describe the main features of a dataset. Understanding data organization is essential for accurate statistical calculations.

\[\text{Frequency Distribution} = \{(x_i, f_i) : i = 1, 2, ..., k\}\]

where \(x_i\) represents values/categories and \(f_i\) represents their frequencies.

\[\text{Total Frequency} = n = \sum_{i=1}^{k} f_i\]

Application#

Examples#

Example 1: Calculating Statistics from Frequency Tables#

Let’s calculate basic summary statistics from grouped data.

Data: Test scores with frequencies: 60-69 (3), 70-79 (7), 80-89 (12), 90-99 (8)

Method 1: Find Modal Class and Total

\(\text{Total students} = 3 + 7 + 12 + 8 = 30 \quad \text{(sum all frequencies)}\)

\(\text{Modal class} = 80-89 \quad \text{(highest frequency of 12)}\)

\(\text{Relative frequency of modal class} = \frac{12}{30} = 0.4 \quad \text{(40\% of data)}\)

Interactive Visualization: Frequency Distribution Explorer#

Interactive Graph
Frequency distribution and statistical measures visualization will be implemented here

Multiple Choice Questions#

Descriptive Statistics#

Theory#

Foundational Definitions: Descriptive statistics summarize and describe the main features of a dataset through numerical measures and graphical representations. These statistics provide insight into the center, spread, and shape of data distributions.

Measures of Central Tendency:

Mean (Arithmetic Average): The sum of all values divided by the number of values

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}\]

• Properties: Affected by extreme values (outliers) • Uses all data values in calculation • For grouped data: \(\bar{x} = \frac{\sum f_i x_i}{\sum f_i}\) where \(x_i\) is class midpoint • Weighted mean: \(\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i}\) where \(w_i\) are weights

Median: The middle value when data is arranged in order

• For odd n: Median = value at position \(\frac{n+1}{2}\) • For even n: Median = average of values at positions \(\frac{n}{2}\) and \(\frac{n}{2} + 1\) • Properties: Not affected by extreme values • For grouped data: \(\text{Median} = L + \frac{\frac{n}{2} - F}{f} \times h\) where L = lower boundary of median class, F = cumulative frequency before median class, f = frequency of median class, h = class width

Mode: The most frequently occurring value(s)

• Can have no mode, one mode (unimodal), or multiple modes (bimodal, multimodal) • For grouped data: Modal class has highest frequency • Estimated mode: \(\text{Mode} = L + \frac{d_1}{d_1 + d_2} \times h\) where \(d_1\) = difference with previous class frequency, \(d_2\) = difference with next class frequency

Measures of Spread (Dispersion):

Range: The difference between maximum and minimum values

\[\text{Range} = x_{max} - x_{min}\]

• Simple but affected by extreme values • Interquartile range (IQR): \(IQR = Q_3 - Q_1\) (more robust)

Variance: Average squared deviation from the mean

Population variance: $\(\sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n}\)$

Sample variance: $\(s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}\)$

• Alternative formula: \(\sigma^2 = \frac{\sum x_i^2}{n} - \mu^2\) • For grouped data: \(\sigma^2 = \frac{\sum f_i(x_i - \bar{x})^2}{\sum f_i}\)

Standard Deviation: Square root of variance

\[\sigma = \sqrt{\sigma^2} \quad \text{(population)}\]
\[s = \sqrt{s^2} \quad \text{(sample)}\]

• Properties: Same units as original data • Approximately 68% of data within 1 standard deviation of mean (normal distribution) • Approximately 95% within 2 standard deviations • Approximately 99.7% within 3 standard deviations

Measures of Position:

Quartiles: Divide ordered data into four equal parts

\(Q_1\) (First quartile): 25th percentile • \(Q_2\) (Second quartile): 50th percentile (median) • \(Q_3\) (Third quartile): 75th percentile • Position formula: \(Q_k\) position = \(\frac{k(n+1)}{4}\)

Percentiles: Divide data into 100 equal parts

\[P_k \text{ position} = \frac{k(n+1)}{100}\]

Box Plots (Box-and-Whisker Plots): Visual summary of five-number summary

• Minimum, \(Q_1\), Median, \(Q_3\), Maximum • Shows outliers: values beyond \(Q_1 - 1.5 \times IQR\) or \(Q_3 + 1.5 \times IQR\) • Indicates skewness through asymmetry

Coefficient of Variation: Relative measure of variability

\[CV = \frac{s}{\bar{x}} \times 100\%\]

• Useful for comparing variability between datasets with different units or scales • Dimensionless measure

Interactive Visualization: Descriptive Statistics Explorer#

Interactive Graph
Interactive descriptive statistics calculator and visualization will be implemented here

Application#

Examples#

Example 1: Calculating Mean, Median, and Mode#

Solve: Find the mean, median, and mode for the dataset: 12, 15, 18, 15, 22, 15, 19, 20, 18, 25

Method 1: Direct Calculation

\(\text{Ordered data: } 12, 15, 15, 15, 18, 18, 19, 20, 22, 25 \quad \text{(arrange in order)}\)

\(\bar{x} = \frac{12 + 15 + 15 + 15 + 18 + 18 + 19 + 20 + 22 + 25}{10} = \frac{179}{10} = 17.9 \quad \text{(mean)}\)

\(\text{Median} = \frac{18 + 18}{2} = 18 \quad \text{(average of 5th and 6th values)}\)

\(\text{Mode} = 15 \quad \text{(appears 3 times, most frequent)}\)

Example 2: Computing Standard Deviation#

Solve: Calculate the population standard deviation for: 4, 7, 9, 10, 14

Method 1: Definition Formula

\(\mu = \frac{4 + 7 + 9 + 10 + 14}{5} = \frac{44}{5} = 8.8 \quad \text{(calculate mean first)}\)

\(\sum(x_i - \mu)^2 = (4-8.8)^2 + (7-8.8)^2 + (9-8.8)^2 + (10-8.8)^2 + (14-8.8)^2\)

\(= 23.04 + 3.24 + 0.04 + 1.44 + 27.04 = 54.8 \quad \text{(sum of squared deviations)}\)

\(\sigma = \sqrt{\frac{54.8}{5}} = \sqrt{10.96} = 3.31 \quad \text{(population standard deviation)}\)

Method 2: Alternative Formula

\(\sum x_i^2 = 16 + 49 + 81 + 100 + 196 = 442 \quad \text{(sum of squares)}\)

\(\sigma^2 = \frac{442}{5} - (8.8)^2 = 88.4 - 77.44 = 10.96 \quad \text{(variance)}\)

\(\sigma = \sqrt{10.96} = 3.31 \quad \text{(confirms our result)}\)

Example 3: Finding Quartiles and Creating Box Plot#

Solve: Find the five-number summary for: 23, 25, 28, 30, 32, 35, 38, 40, 42, 45, 48, 52, 55

Method 1: Position Formula

\(n = 13 \quad \text{(count data points)}\)

\(Q_1 \text{ position} = \frac{1(13+1)}{4} = 3.5 \quad \text{(between 3rd and 4th values)}\)

\(Q_1 = \frac{28 + 30}{2} = 29 \quad \text{(interpolate)}\)

\(Q_2 \text{ (median) position} = \frac{2(13+1)}{4} = 7 \quad \text{(7th value)}\)

\(Q_2 = 38 \quad \text{(middle value)}\)

\(Q_3 \text{ position} = \frac{3(13+1)}{4} = 10.5 \quad \text{(between 10th and 11th values)}\)

\(Q_3 = \frac{45 + 48}{2} = 46.5 \quad \text{(interpolate)}\)

\(\text{Five-number summary: Min}=23, Q_1=29, \text{Median}=38, Q_3=46.5, \text{Max}=55\)

Multiple Choice Questions#

Sector Specific Questions: Descriptive Statistics Applications#

Key Takeaways#

Important

  1. Mean, median, and mode: Each measures center differently - choose based on data distribution and outliers

  2. Standard deviation: Measures typical distance from the mean; use with mean for symmetric data

  3. Quartiles and IQR: Robust measures unaffected by outliers; use with median for skewed data

  4. Variance uses squared units: Always take square root for standard deviation to match data units

  5. Coefficient of variation: Enables comparison of variability across different scales or units

  6. Box plots: Visualize five-number summary and identify outliers using 1.5×IQR rule

  7. Sample vs population: Use n-1 for sample variance/standard deviation, n for population

  8. Grouped data formulas: Use class midpoints and frequencies when raw data unavailable