Descriptive statistics
[MG1:Chp2, p7-p18]
 
 
Summaries of sample data (statistics) are defined by Roman letters (sample mean)
Summaries of population data (parameters) are defined by Greek letters (mu, variance)
 
 
 
Central tendency = The extent that observations cluster
Degreee of dispersion = The spread of the observations about a central location
 
Measures of central tendency
    - Mode = The most common value
 
    - Median = The middle value
 
    - (Arithmetic) Mean = The average value
 
 
Degree of dispersion
    - Range = Difference between the maximum and minimum value
 
    - Percentile = Rank observations into 100 equal parts
    
    * Mean = 50th percentile
    
    * Interquartile range = 25th to 75th percentile 
    - Sample Variance = Sum of squares divided by degree of freedom
    
    * Sum of squares = sum of the square of each differences (between each observation and the mean)
    
    * Degree of freedom = number of observation minus 1 
    - Population variance = Sum of squares divided by number of observation
 
    - Standard deviation = Square root of variance
 
    - Coefficient of variation (CV) = SD / mean x 100%
 
NB:
    - Degree of freedom is used when calculating the variance of a sample
    
    * Because each observation is free to vary except for the last one which must be a defined value in order for the mean match the fixed sample mean value 
Sources of variability
    - Biological variability
 
    - Measurement imprecision
    
    --> Resulting in random error 
    - Mistakes or biases in measurement
    
    --> Systemic error 
Standard error (SE)
[MG1:p9]
    - Standard error (SE)
    
    = aka standard error of the mean 
    - SE = SD / square root of n
 
    - SE is NOT meant to be used to describe variability of sample data
 
    - SE is a measure of precision (of how well sample data can be used to predict population mean (a population parameter))
    
    * Used to calculate confidence interval
    
    * Often derived from one sample
    
    * Reliability of sample mean in predicting population mean [Chris Flynn] 
    - SE is the standard deviation of the sample means
 
    - Increasing sample size can be a way of reducing SE
    
    * But need to increase sample 4 times to reduce SE by half 
Confidence interval
    - Derived from SE
 
    - 95% confidence interval of the mean = sample mean +/- (1.96 x SE)
 
    - 99% confidence interval = sample mean +/- (2.58 x SE)
 
    - Definition of 95% CI
    
    = The range within which there is 95% probability the true population mean may lie 
NB:
    - In a normal distribution, 95% of the observations lie within 1.96 standard deviation of the mean
 
Frequency distributions
    - Kurtosis describes how peaked the distribution is
    
    * Kurtosis of a normal distribution = 0 
    - Median is a better measurement of central tendency in a skewed distribution
    
    * Skew to the right, median will be smaller than the mean 
    - Bimodal distribution = Distribution with two peaks
    
    --> Suggests that the sample is not homogeneous and may represent two different populations 
Normal distribution
    - Sometimes referred to as a Gaussian distribution
 
    - Two parameters define the curve, mu (the mean), and sigma (the standard deviation)
 
    - Mode = median = mean
 
    - Formula at [MG1:p13]
 
NB:
    - Mean +/- 1 SD includes 68% of total area 
 
    - Mean +/- 1.96 SD includes 95% of total area
 
    - Mean +/- 2 SD includes 95.4% of total area
 
    - Mean +/- 3 SD includes 99.7% of total area
 
Z distribution
In a STANDARD normal distribution
    - Mean = 0
 
    - Standard deviation = 1
 
    - aka the z distribution
 
    - A z transformation converts any normal distribution curve (with different mean and SD) to a standard normal distribution curve (mean = 0, SD = 1)
    
    * z = (x - mu)/SD 
Central limit theorem
[MG1:p14]
    - As the number of observations increase (n>100)
    
    --> The shape of a sampling distribution will approximate a normal distribution curve
    
    * Even if the distribution of the variable is not normal 
Binomial distribution
[MG1:p14-p15]
Formula at [MG1:p15]
A binomial distribution exists if a population contains items which belong to one of two mutually exclusive categories
* e.g. gender, complication
Conditions include:
    - Fixed number of observations (trials)
 
    - Only two outcomes are possible
 
    - Trials are independent
 
    - Constant probability for occurrence of each event
 
Poisson distribution
    - A binomial distribution approximates Poisson distribution when
    
    * The number of observation is very large, AND
    
    * Probability of an event is small (<0.05) 
    - A single parameter (lamda) which is both mean and the variance
 
Conditions:
    - Events occur randomly
 
    - Events occur independently
 
    - Events occur uniformly (same probability) and singly
 
Example used in [MG1:p15] is for calculation of probability of more than one admission on late night admission
Incidence and prevalence
    - Incidence = the number of individuals who develop a condition (i.e. new cases) in a given time period
    
    --> An estimation of probability of developing a disease in a specified time period 
    - Prevalence = the number of individuals with a condition at a point of time (i.e. total cases, pre-existing and new)
 
Presentation of data
[MG1:p17]
    - For a normal distribution, mean and standard deviation are the best statistics to describe data
    
    * But mean can be affected by extreme values 
    - A bimodal distribution is best described with mode
 
    - Ordinal data should be described with mode or median
 
Box and whisker plot
    - Used to depict mean, interquartile range and range
 
    - Middle line = median
 
    - Box = 25th to 75th percentiles
 
    - Whiskers = minimum and maximum, or 5th and 95th percentiles