Key Terms and Definitions

  • Statistics is the science studying the collection, analysis, interpretation, and presentation of data

  • Descriptive statistics is the practice of organizing and summarizing data

  • Inferential statistics is the use of probability to determine how confident conclusions are

  • Populations refer to the an entire group

  • Samples are subsets of the population being studied

  • Statistics are numerical representations of a property of a sample

  • Parameters are numerical characteristics of a population that can be estimated with a statistic

  • Representative samples contain the same characteristics as the population

  • Variables are characteristics or measurements that can be determined for each member of a population

    • Numerical variables take on values with units
    • Categorical variables place the member into a category
  • Data are the variable’s actual values

    • Datum is a single value
  • Average is often interchangeably used to refer to the arithmetic mean

Data, Sampling, and Variation

  • Qualitative/categorical data describe categories or attributes of a population (e.g. blood type)

  • Quantitative data are measured in numbers as the result of counting or measuring attributes (e.g. height)

    • Quantitative discrete data are measured in whole numbers (number of legs on an insect)
    • Quantitative continuous data can have fractions, decimals, etc. (weight of a person)

Visualizing Qualitative Data

  • Pie charts use proportional wedges to represent categories
  • Bar graphs use vertical or horizontal bars
  • Pareto charts use bars descending in categorical size

Percentages can add to >100% if members of the population fall into more than one category.

Sampling Methods

  • Simple random sample: n individuals are chosen where each individuals has the same odds of being chosen (drawing numbers from a hat)

  • Stratified sample: population is divided into homogenous strata (groups) and a proportionate number is taken from each stratum by simple random sample

    • Cluster sampling divides population into heterogenous clusters
  • Systematic sample: a starting point is selected at random and every nth piece of data is taken

  • Sampling with replacement: the chosen member is added back to the population before the next member is chosen

    • Replacement guarantees that samples are independent of each other
    • 10% rule: of the population if sampling without replacement
  • Convenience sampling: non-random sampling where the most available data is chosen (online surveys)

  • Sampling errors are caused by the sampling process

  • Non-sampling errors are caused by factors not related to the sampling process

  • Sampling biasses are created when the sampling is when some members are more likely to be chosen

Variation in Data

  • Data can vary due to measurement methods

Variation in Samples

  • Samples can vary due to method, size

Frequency, Levels of Measurement

Levels of Measurement

  • Nominal scales are qualitative, unordered: names, labels, categories, colors
  • Ordinal scales are qualitative, ordered: top five ranks
  • Interval scales are ordered, relatively compared, but have no absolute starting point: temperature
  • Ratio scales are ordered, have absolute starting points, can be compared using ratios: sports scores

Frequency

  • Frequency is the number of times a value occurs
    • Relative frequency is the ratio of the number of times a value occurs in the dataset
    • Cumulative relative frequency is the running total of relative frequencies

Experiment Design

  • Explanatory variables cause change

    • Treatments are different values of the explanatory variable
    • Lurking variables affect the response variable but are not considered
    • Confounding variables are considered but cannot be distinguished from one another
  • Response variables change due to explanatory variables

  • Experimental units are single objects or individuals measured

  • Control groups receive a placebo

    • Blinding is when the subject does not know what treatment they are receiving
    • Double blinding is when neither the experimenter nor the subject knows what treatment is being administered
  • Participants should give informed consent

Sampling (Distributions) II

Continuous Random Variables

It is not possible to take the probability of any individual value within a continuous random variable’s distribution (PDF). We can only obtain the probability of a specified range within the distribution (CDF).

Example

The amount of money Josh contributes to his savings account each month is normally distributed with a mean of $55.20 and standard distribution of $8.15.

What is the probability that Josh contributes more than $60 next month?

-score:

NormCD(60, 100000, 8.15, 55.20)
>>> 0.2779

What amount of money would indicate the top 5% and bottom 5% of contributions?

InvNormCD(0.95, 8.15, 55.20)
>>> 68.6055
InvNormCD(0.05, 8.15, 55.20)
>>> 41.7944

Combining continuous normal distributions:

Sample Statistics

Sample statistics are point estimators of their corresponding population parameters:

Sample statistics will never perfectly match the population. Sampling variability refers to how much estimates vary from sample to sample.

See: Chapter 7 – The Central Limit Theorem

  • then is an unbiased estimator
  • samples for the CLT to apply
    • True sampling distributions account for all possible sample statistics for all samples of a given size

Sampling Distribution Model

Samples must be random (unbiased), size < 10% of population (assuming independence, have 10 successes/failures.

  • Center:
  • Spread:
  • Normally distributed

Differences in Sample Proportions

  • Center:
  • Spread:
  • Normally distributed

Example

80% of UBC students () and 76% of SFU students () passed their Calculus I finals.

Find the differences in center and spread.

Find the probability that a sample of 100 SFU students has a higher proportion of passage than a sample of 75 students from UBC.

NormCD(-100000, 0, 0.04, 0.0629)
>>> 0.262411

Differences in Sampling Distributions

  • Center:
  • Spread:

Example

The mean weight of all male orca whales () follows a normal distribution with a mean of 12000 lbs and a standard deviation of 800 lbs.

The mean weight of all female orca whales () follows a normal distribution with a mean of 10000 lbs and a standard deviation of 900 lbs.

What is the probability a sample of 15 male orcas will have a sample mean that is 3000 lbs or more than the sample mean of 10 female orcas?

NormCD(3000, 100000, 2000, 351.663)
>>> 0.002301

The occurrence of a sample statistic with a low probability causes questions about the original population parameter.