Statistics for Data Science (I)
Data science has basically two pillars: computer science and statistics. While computer science provides a computing platform for data science, statistics provides all the theories and models for the analysis work in every data science project.
Discrete Random Variables
- boolean algebra
- set notation
- probability
- variables
- discrete univariate
- discrete multivariate
- continuous univaraite
- continuous multivariate
- probability distribution
- PMF (Probability Mass Function): a function that gives the probability that a discrete random variable is exactly equal to some value.
- CDF (Cumulative distribution Function): defines the probability that X will take a value less than or equal to x.
- properties of a distribution: mean, median, mode, variance, standard deviation, quantiles, entropy
- entropy defines the “level of uncertainty” of the outcome. When there is no randomness, entropy will be zero. When randomness of outcome is high, entropy will be high.
- linearity of expectations: when we say the expected value is linear in x, it means:
*E[aX] = aE[X]* *E[x + Y] = E[X] + E[Y]*
-
conditional probability: *P(A B) = P(A and B)/P(B)* - conditional expectations
- conditional distribution
Well-Known Discrete Distributions
- Bernoulli distribution: A variable that is either 1 (with probability p) or 0 (with probability 1-p). The mean is also p.
- Binomial distribution: The sum of n Bernoullis. Parameters: p and n$.
- Categorical distribution: Generalized Bernoulli to more than 2 values.
- Uniform distribution: A special case of categorical (sort of), in which p_i=1/K for all i=1,…,K.
- Multinomial distribution: Generalized binomial to more than 2 values per trial. Parameters: n, p_i for i=1,…,K.
- Geometric distribution: The number of attempts to achieve your first “success” in sequential trials. Parameter p is the probability of success
- Poisson distribution: has only one parameter, the mean, typically called $\lambda$
Discrete Multivariate Distribution
Joint distribution:
- P(X=x, Y=y, Z=z)
- when we say random variable X and Y are independent to each other, it means:
P(X, Y) = P(X)P(Y) P(X|Y) = P(X) E[XY] = E[X]E[Y]
marginal distribution:
assume X and Y have a joint PMF P(x, y):
- P(X = x_i) is a marginal probability
- P(X = x) is a marginal distribution
properties of multivariate distribution
- Covariance
- Correlation
Bayes’ Theorem: P(A|B) = P(B|A) P(A) / P(B)
Law of total probability:
Continuous Random Variables
- PDF (Probability Density Function): defines the distribution of a continuous random variable.
- PDF is not probability, it is density
- probability of a particular value is zero. probability can only be calculated for a value zone.
- CDF (Cumulative distribution Function): defines the probability that X will take a value less than or equal to x.
- Expected values
- Expected values of functions
- joint PDF:
Well-Known Continuous Distributions
- Uniform distribution: p(x) = 1/(b-a) for a<x<b
- Gaussian distribution (aka “Normal distribution”): Parameters: mean, standard deviation
- Exponential distribution
- Beta distribution
Relevant R Functions
- The uniform distribution: dunif(), punif(), qunif(), runif()
- The normal (Gaussian) distrubution: dnorm(), pnorm(), qnorm(), rnorm()
- The exponential distribution: dexp(), pexp(), qexp(), rexp()
Metrices
- Random vector
- multivariate Gaussian covariance matrix
- how many parameters?
Relevant Python Functions
- random numbers: randn(), rand(), randint()
- probability functions: binom.pmf()
Maximum Likelihood Estimation (MLE)
- MLE is NOT a probability distribution
Written on December 19, 2018