Key Terms and Definitions
-
Statistics is the science studying the collection, analysis, interpretation, and presentation of data
-
Descriptive statistics is the practice of organizing and summarizing data
-
Inferential statistics is the use of probability to determine how confident conclusions are
-
Populations refer to the an entire group
-
Samples are subsets of the population being studied
-
Statistics are numerical representations of a property of a sample
-
Parameters are numerical characteristics of a population that can be estimated with a statistic
-
Representative samples contain the same characteristics as the population
-
Variables are characteristics or measurements that can be determined for each member of a population
- Numerical variables take on values with units
- Categorical variables place the member into a category
-
Data are the variableβs actual values
- Datum is a single value
-
Average is often interchangeably used to refer to the arithmetic mean
Data, Sampling, and Variation
-
Qualitative/categorical data describe categories or attributes of a population (e.g. blood type)
-
Quantitative data are measured in numbers as the result of counting or measuring attributes (e.g. height)
- Quantitative discrete data are measured in whole numbers (number of legs on an insect)
- Quantitative continuous data can have fractions, decimals, etc. (weight of a person)
Visualizing Qualitative Data
- Pie charts use proportional wedges to represent categories
- Bar graphs use vertical or horizontal bars
- Pareto charts use bars descending in categorical size
Percentages can add to >100% if members of the population fall into more than one category.
Sampling Methods
-
Simple random sample: n individuals are chosen where each individuals has the same odds of being chosen (drawing numbers from a hat)
-
Stratified sample: population is divided into homogenous strata (groups) and a proportionate number is taken from each stratum by simple random sample
- Cluster sampling divides population into heterogenous clusters
-
Systematic sample: a starting point is selected at random and every nth piece of data is taken
-
Sampling with replacement: the chosen member is added back to the population before the next member is chosen
- Replacement guarantees that samples are independent of each other
- 10% rule: of the population if sampling without replacement
-
Convenience sampling: non-random sampling where the most available data is chosen (online surveys)
-
Sampling errors are caused by the sampling process
-
Non-sampling errors are caused by factors not related to the sampling process
-
Sampling biasses are created when the sampling is when some members are more likely to be chosen
Variation in Data
- Data can vary due to measurement methods
Variation in Samples
- Samples can vary due to method, size
Frequency, Levels of Measurement
Levels of Measurement
- Nominal scales are qualitative, unordered: names, labels, categories, colors
- Ordinal scales are qualitative, ordered: top five ranks
- Interval scales are ordered, relatively compared, but have no absolute starting point: temperature
- Ratio scales are ordered, have absolute starting points, can be compared using ratios: sports scores
Frequency
- Frequency is the number of times a value occurs
- Relative frequency is the ratio of the number of times a value occurs in the dataset
- Cumulative relative frequency is the running total of relative frequencies
Experiment Design
-
Explanatory variables cause change
- Treatments are different values of the explanatory variable
- Lurking variables affect the response variable but are not considered
- Confounding variables are considered but cannot be distinguished from one another
-
Response variables change due to explanatory variables
-
Experimental units are single objects or individuals measured
-
Control groups receive a placebo
- Blinding is when the subject does not know what treatment they are receiving
- Double blinding is when neither the experimenter nor the subject knows what treatment is being administered
-
Participants should give informed consent
Sampling (Distributions) II
Continuous Random Variables
It is not possible to take the probability of any individual value within a continuous random variableβs distribution (PDF). We can only obtain the probability of a specified range within the distribution (CDF).
Example
The amount of money Josh contributes to his savings account each month is normally distributed with a mean of $55.20 and standard distribution of $8.15.
What is the probability that Josh contributes more than $60 next month?
-score:
NormCD(60, 100000, 8.15, 55.20) >>> 0.2779What amount of money would indicate the top 5% and bottom 5% of contributions?
InvNormCD(0.95, 8.15, 55.20) >>> 68.6055 InvNormCD(0.05, 8.15, 55.20) >>> 41.7944
Combining continuous normal distributions:
Sample Statistics
Sample statistics are point estimators of their corresponding population parameters:
Sample statistics will never perfectly match the population. Sampling variability refers to how much estimates vary from sample to sample.
See: Chapter 7 β The Central Limit Theorem
- then is an unbiased estimator
- samples for the CLT to apply
- True sampling distributions account for all possible sample statistics for all samples of a given size
Sampling Distribution Model
Samples must be random (unbiased), size < 10% of population (assuming independence, have 10 successes/failures.
- Center:
- Spread:
- Normally distributed
Differences in Sample Proportions
- Center:
- Spread:
- Normally distributed
Example
80% of UBC students () and 76% of SFU students () passed their Calculus I finals.
Find the differences in center and spread.
Find the probability that a sample of 100 SFU students has a higher proportion of passage than a sample of 75 students from UBC.
NormCD(-100000, 0, 0.04, 0.0629) >>> 0.262411
Differences in Sampling Distributions
- Center:
- Spread:
Example
The mean weight of all male orca whales () follows a normal distribution with a mean of 12000 lbs and a standard deviation of 800 lbs.
The mean weight of all female orca whales () follows a normal distribution with a mean of 10000 lbs and a standard deviation of 900 lbs.
What is the probability a sample of 15 male orcas will have a sample mean that is 3000 lbs or more than the sample mean of 10 female orcas?
NormCD(3000, 100000, 2000, 351.663) >>> 0.002301
The occurrence of a sample statistic with a low probability causes questions about the original population parameter.