[TOC]

Random variable

Definition of a random variable

A random variable is a function $\omega$(i.e., an event in the sample space $\Omega$) that returns a number x.

随机变量既可以被看作是一个变量,也可以被看作是一个函数。在概率论中,我们通常将随机变量定义为一个映射,将样本空间中的事件映射到实数轴上的值。因此,在这个定义下,随机变量可以被视为一个函数。例如,在问题中,随机变量 $\omega$ 被定义为将样本空间 $\Omega$ 中的事件映射到数值 x 上的函数。然而,在实际应用中,我们可以将随机变量看作是一个变量,它可以随机地取不同的数值。

  • E.g., let X be the random variable defined by the roll of a fair die and denote x as the result of a single roll. The probability that the random variable is equal to five can be expressed as:

    P(X=x) when x=5 or P(X=5)

Two classes of random variables

● Discrete random variable

  • Assigns a probability to a distinct set of values, which can be either finite or contain a countably infinite set of values.

  • When the set of values is infinite, the set must be countable.

● Continuous random variable

  • Produces values from an uncountable set.

无限可分的样本空间是不可数的。

Discrete random variables

Probability mass function (PMF)【概率质量函数】

  • The function returns the probability that a random variables take a certain value.
    $$
    f_X(x)=\mathrm{P}(\mathrm{X}=x)
    $$
  • The value returned from PMF must be non-negative.
  • The sum of across all values in the support of a random variable must be one.

Cumulative distribution function (CDF)

  • Measures the total probability of observing a value less than or equal to the input ${x}$.

  • $F_X(x)={P}(X \leq x)$

    • $F(X)$ is a non-decreasing function such that if $x_2>x_1$, then $F(x_2) \geq F(x_1)$.

    • $P(X>k)=1-F(k)$

    • $P\left(x_1<X \leq x_2\right)=F\left(x_2\right)-F\left(x_1\right)$

    The relationship between PMF and CDF

Can always be expressed as the sum of PMF for all values in support that are less than or equal to $x: F_ {X}(x)=\sum_ {t \leq x} {f}_{X}(t)$

Suppose $X$ is a random variable defined by the roll of a fair die and $x$ is the result of a single roll. Please express the probability mass function and Cumulative distribution function for $X$.
Correct Answer:

  • The PMF of $X$ can be equivalently expressed using a list of values

image-20231125184207607

  • The counterparty of PMF is the cumulative distribution function

image-20231125184214952

image-20231125184447095

image-20231125184457610

Expectations and moments【矩】

Mathematical expectation of random variable

  • The weighted average mean of the random variable (denoted as E[X]) is defined as:
    $$
    E[X]=\sum P\left(\mathrm{X}=x_i\right) x_i
    $$

Expectation operator

  • The expectation operator $E[]$ is computing a weighted average of its possible values.
  • Linear properties of expected value
  • If $b$ is a constant, $E[b]=b, E[E[X]]=E[X]$.
  • If $a$ is a constant, $E[a X]=a E[X]$.
  • If $a, b$ and $c$ are constants, then $E[a X+b Y+c]=a E[X]+b E[Y]+c$.

Moment: As stated previously, moments are a set of commonly used descriptive measures that characterize important features of random variables.

  • Two types of moments

    • Central moment: $\mu_K=E\left[(X-E[X])^K\right],(k \geq 2)$
      • The second central moment is defined as the variance of $X$, or $\operatorname{Var}[X]$
  • Non-central moment: $\mu_k^{N C}=E\left[X^K\right],(k \geq 1)$

    • The first moment is defined as the expected value of $X$, or $E[X]$
  • Relationships between Central moment and Non-central moment

$$
E\left[(X-E[X])^2\right]=E\left[X^2\right]-E[X]^2=\mu_2^{N C}-\left(\mu_1^{N C}\right)^2
$$

降矩公式。

The Four Named Moment.

  • The first moment is the mean: $\mu(X)=E[X]$

衡量数据的中心趋势。

  • The second central moment is the variance: $\sigma^2(X)=E\left[(X-\mu)^2\right]$
    $ \sigma^2(a X)=E\left[(a X-a \mu)^2\right]=a^2 \sigma^2(X)$

    • The standard deviation is denoted by $\sigma$ and is defined as the square root of the variance (i.e., $\sqrt{\sigma^2}$ ).

      • more natural measure of dispersion【离散】

      • directly comparable to the mean(same unit)

      衡量数据的离散程度。数据越集中在均值左右部分,集中程度越高。

  • The third moment is the skewness【偏度】:

$$
\operatorname{skew}(X)=\frac{E\left[(X-\mu)^3\right]}{\sigma^3}=E\left[\left(\frac{X-\mu}{\sigma}\right)^3\right]
$$

三阶中心矩/σ三次方,衡量一组数据是否是对称的。那边尾巴长,就往哪边偏。

  • The fourth moment is the kurtosis【峰度】:

$$
\operatorname{kurtosis}(X)=\frac{E\left[(X-\mu)^4\right]}{\sigma^4}=E\left[\left(\frac{X-\mu}{\sigma}\right)^4\right]
$$

四阶中心矩/σ四次方,衡量数据尾巴的薄厚。尾巴的薄厚表示了极端值出现的可能性。

峰度越大,尾巴越厚。在方差一致的情况下,才能比较峰度。

The effect of changes in four named moments

image-20231125185834588

Standardization【标准化】 of random variable: When $\mathrm{X}$ has mean $\mu$ and variance $\sigma^2$, a standardized version of $X$ can be constructed as
$$
\frac{X-\mu}{\sigma}
$$

  • This variable has mean 0 and unit variance (and standard deviation)

$$
\begin{aligned}
& E\left[\frac{X-\mu}{\sigma}\right]=0 \
& V\left[\frac{X-\mu}{\sigma}\right]=1
\end{aligned}
$$

Continuous random variable

Probability density function (PDF)【概率密度函数】 is used instead of PMF

  • PMF can not be used in continuous random variable because $P(X=x)=0$ even though $x$ can occur.

  • The PDF $f_X(x)$ returns a non-negative value for any input in the support of $X$, it is used to find interval probability.

    • $P\left(x_1<X<x_2\right)=\int_{x_1}^{x_2} f_X(x) d x$.

    • The total area under the curve $f(x)$ is 1 .

    • $P\left(x_1<X<x_2\right)$ is the area under the curve between $x_1$ and $x_2$.

    • $P\left(x_1 \leq X \leq x_2\right)=P\left(x_1<X \leq x_2\right)=P\left(x_1<X<x_2\right)$

  • The CDF $F_X(x)$ of a continuous random variable is identical to that of a discrete random variable.

$$
F_X(x)=\int_{-\infty}^x f_X(\mathrm{z}) d z
$$

image-20231127205558214

Quantiles and modes

Inverse Cumulative Distribution Function (CDF)

  • If we want to know what’s the prob. that a random variable is less than 3, we can simply calculate $F(3)$;
  • If we want to know what’s the corresponding random variable that indicates a cumulated prob. of $F(x)$, we can simply calculate $F^{-1}(x)$, which is called inverse cumulative distribution function.

如果我们想要知道表示累积概率F(x)的相应随机变量,我们可以简单地计算 $F^{-1}(x)$,这被称为累积分布函数的逆

Example:

  • The CDF is characterized as follow, find the value of a such that 25% of the distribution is less than or equal to $x$.
    $$
    F(x)=\frac{x^2}{100} \text { s. } t .0 \leq x \leq 10
    $$
  • Correct Answer: $F^{-1}(x)=10 \sqrt{x}$, if $x=0.25$, then $F^{-1}(x)=5$

Quantiles can be used to construct an alternative set of descriptive measures of a random variable.

  • For a continuous or discrete random variable $X$, the $\alpha$-quantile $X$ is the smallest number q such that $\operatorname{Pr}(X<q)=\alpha$.
  • The way to calculate the $\alpha$-quantile is the same as finding the inverse cumulative distribution function.
    $$
    Q_X(\alpha)=F_X^{-1}(\alpha)
    $$

Example:

  • The CDF is characterized as follow, find the 25%-quantile of random variable, $X$.
    $$
    F(x)=\frac{x^2}{100} \text { s. } t .0 \leq x \leq 10
    $$
  • Correct Answer: $\mathrm{Q}(0.25)=F^{-1}(0.25)=5$

Mean, Median and Mode【众数】

  • Mean: the average value of random variable $X$ and is also referred to as the location of distribution.

    • Very sensitive to large outliers.
  • Median: the middle number【针对离散值】 or 50%-quantile【针对连续型随机变量】 of an random variable

中位数可能受到极端值影响,也可能不会。

  • Mode: the random variables that occur most frequently
    • random variables may have one or more modes.

极端值对频次高的数据没有影响。

image-20231127205823170

右偏时,均值>中位数>众数;反之亦然。这三个数都是一阶矩。可以通过这三个数来判断三阶矩。

当对称时,均值=中位数=众数。

Questions

image-20231127213048267