STT Prelim Study Guide

Table of Contents

1 Linear Algebra

1.1 Basic Elements of Linear Algebra

Chapter 2 in Khuri

A vector space over \(\mathcal{R}\) is a set \(V\) of elements, which can be added or multiplied by scalars, in such a way that the sum of two elements of \(V\) is an element of \(V\), and the product of an element of \(V\) by a scalar is an element of \(V\). It also includes the zero element and negatives.

For example, the set of all \(n\) -tuples is a vector space.

Another example of a vector space is the set of all polynomials.

A vector subspace is basically a subset of a vector space that is itself a vector space. So for example, the set of all polynomials of degree \(k\) or less is a vector subspace of the set of all polynomials.

The intersection of two subspaces is also a subspace. This is not necessarily true for union.

The elements of a vector space \(V\) can be represented as linear combinations of a set of elements of \(V\) that form a basis of \(V\).

Linearly dependent means there is a linear combination of the elements that equal zero, where the coefficients in the linear combination are not all zero. The linear span of \(n\) elements in a vector space is the collection of all linear combinations of those elements. The linear span forms a vector subspace.

The basis of a vector space \(V\) is formed by \(n\) many linearly independent elements in \(V\) whose linear span is equal to \(V\). This number \(n\) is called the dimension of \(V\).

A function that maps one vector space to another vector space over \(\mathcal{R}\) is called a linear transformation.

1.2 Basic Concepts in Matrix Algebra

Chapter 3 in Khuri

For two matrices to be equal means that all of their elements are equal.

When we take the transpose of a matrix, the rows become columns and the columns become rows.

The trace of a matrix is the sum of its diagonal elements. We have \(tr(A) = tr(A^\prime)\).

Also \(tr(ABC) = tr(BCA) = tr(CAB)\)

\(tr(A - B) = tr(A) = tr(B)\)

For symmetrix matrix \(B\), \(tr(B) = \sum \text{eigenvalues}(B)\)

The rank of a matrix is the number of linearly independent rows (or columns).

The inverse of a nonsingular matrix of order \(n \times n\), denoted by \(A^{-1}\), is an \(n \times n\) matrix that satisfies the condition \(AA^{-1} = A^{-1}A = I_n\). Such a matrix is unique.

Properties of invserses include:

  1. \((AB)^{-1} = B^{-1}A^{-1}\)
  2. \((A^{\prime})^{-1} = (A^{-1})^{\prime}\)
  3. \((A \otimes B)^{-1} = A^{-1} \otimes B^{-1}\)

A generalized inverse of an \(m \times n\) matrix \(A\) is any \(n \times m\) matrix \(B\) such that \(ABA = A\). We denote such a matrix by \(A^{-}\).

A square matrix, \(A\), is idempotent if \(A^2 = A\).

For idempotent matrix \(A\), \(\text{rank}(A) = tr(A)\).

A square matrix of order \(n \times n\) is orthogonal if \(A^{\prime}A = I_n\).

To show a matrix \(D \in \mathbb{R}^{k \times k}\) is positive definite, show for any quadratic form \(d^{\prime}Dd\) with \(d \neq 0, d^{\prime}Dd > 0\).

Woodbury Formula:

\begin{equation*} A \in \mathbb{R}^{n \times n}, B \in \mathbb{R}^{n \times r}, C \in \mathbb{R}^{r \times r}, D \in \mathbb{R}^{r \times n} \\ (A + BDC)^{-1} = A^{-1} - A^{-1} B (D^{-1} + CA^{-1}B)^{-1} C A^{-1} \end{equation*}

1.3 Matrix Cookbook

Matrix Cookbook

For any square matrix \(X\), we have

\begin{equation*} \frac{\partial \log |X|}{\partial X} = (X^{\prime})^{-1} \end{equation*}

For any square matrix \(X\) and vectors \(a, b\), we have

\begin{equation*} \frac{\partial (a^{\prime} X^{-1} b)}{\partial X} = -(X^{\prime})^{-1} a b^{\prime} (X^{\prime})^{-1} \end{equation*}

2 STT 867 [100%]

2.1 DONE Least Squares, Gauss-Markov Theorem and extensions [8/8]

DONE Mulitple linear regression model

\begin{equation*} Y_i = X_i ^\prime \beta + \varepsilon_i \end{equation*}

Linear means linear with respect to the coefficients \(\beta\)

Fixed design is when \(X \in \mathbb{R}^{n \times p}\) is assumed to be fixed or non-random

Response variable is assumed to be continuous

Simple linear regression \(y_i = \beta_0 + x_i\beta_1 + \varepsilon_i\)

Multivariate linear regression is when we have multiple responses for each set of covariates

DONE Error-in-variables linear regression

\begin{equation*} Y = \tilde{X}\beta + \varepsilon \\ X = \tilde{X} + \tilde{\varepsilon} \end{equation*}

We observe \(X\), not \(\tilde{X}\)

DONE Ordinary least squares

Sum of squares (Residuals) \(|| Y - X\beta ||_2^2\)

The \(L_2\) norm of a vector \(a\) is \(||a||_2 = \sqrt{\sum_{i = 1}^{k} a_i^2}\)

Ordinary least squares (OLS) \(\hat{\beta} = \underset{\beta \in \mathbb{R}^p}{\mathrm{argmin}} ||Y - X\beta ||_2^2\)

Compute gradient, set equal to zero, check Hessian matrix is positive definite \(\Rightarrow \hat{\beta} = (X^\prime X)^{-1} X^\prime Y\)

\(\hat{\beta}\) exists and is unique

DONE Properties of \(\hat{\beta}\)

Assuming \(\mathbb{E}(\varepsilon) = 0\) and \(\text{Cov}(\varepsilon) = \sigma^2 I_{n \times n}\), then \(E(Y) = X\beta\) and \(\text{Var}(Y) = \sigma^2 I_n\) and

  1. \(\hat{\beta}\) is unbiased for \(\beta\)
  2. \(\text{Var}(\hat{\beta}) = \sigma^2 (X^{\prime}X)^{-1}\)
  3. \(c^\prime \hat{\beta}\) is unbiased for \(c^\prime \beta\)

If we also assume normallity of \(\varepsilon\), then we get

\begin{equation*} \hat{\beta} \sim N(\beta, \sigma^2 (X^{\prime}X)^{-1}) \end{equation*}

DONE Guass-Markov Theorem

Among all the unbiased estimates of a linear function of the parameters which are expressible as linear combinations of the observations (elements of the response vector \(Y\)), the one produced by the least-squares procedure has the minimum variance.

This tells us that \(c^\prime \hat{\beta}\) is the best linear unbiased estimator (BLUE) for \(c^\prime \beta\), assuming \(E(\varepsilon) = 0\) and \(\text{Var}(\varepsilon) = \sigma^2 I_n\).

Here \(c^{\prime}\hat{\beta} = c^{\prime}(X^{\prime}X)^{-1} x^{\prime}Y\)

BLUE is unique if \(\sigma^2 > 0\)

DONE Method of Lagrange Multplier

Use this to turn constrained optimization problem into unconstrained problem.

We want to optimize a real valued function \(f(x_1, x_2, \ldots, x_n)\), where \(x_1, x_2, \ldots x_n\) are subject to \(m (< n)\) equality constraints of the form

\begin{equation*} g_1(x_1, x_2, \ldots, x_n) = 0, \\ g_2(x_1, x_2, \ldots, x_n) = 0, \\ \vdots \\ g_m(x_1, x_2, \ldots, x_n) = 0, \\ \end{equation*}

where \(g_1, g_2, \ldots, g_m\) are differentiable functions.

The determination of the stationary points in this constrained optimization problem is done by first considering the function

\begin{equation*} F(\mathbf{x}) = f(\mathbf{x}) + \sum_{j - 1}^{m} \lambda_j g_j(\mathbf{x}) \end{equation*}

where \(\mathbf{x} = (x_1, x_2, \ldots, x_n)^\prime\) and \(\lambda_1, \lambda_2, \ldots, \lambda_m\) are scalars called Lagrange multipliers. By differentiating with respect to \(x_1, x_2, \ldots, x_n\) and equating the partial derivatives to zero, we obtain

\begin{equation*} \frac{\partial F}{\partial x_i} = \frac{\partial f}{\partial x_i} + \sum_{j = 1}^m \frac{\partial g_j}{\partial x_i} = 0 \\ i = 1, 2, \ldots, n \end{equation*}

Together with

\begin{equation*} g_1(x_1, x_2, \ldots, x_n) = 0, \\ g_2(x_1, x_2, \ldots, x_n) = 0, \\ \vdots \\ g_m(x_1, x_2, \ldots, x_n) = 0, \\ \end{equation*}

we have \(m + n\) equations for m + n unknowns (namely the \(x\)'s and \(\lambda\)'s). The solutions for \(x_1, x_2, \ldots, x_n\) determine the locations of the stationary points.

DONE Generalization of Gauss-Markov

This basically extends GM theorem to the case of \(C \beta\), where \(C \in \mathbb{R}^{q \times p}\) and \(\beta \in \mathbb{R}^p\)

\(C \hat{\beta}\) is the unique BLUE for \(C \beta\)

DONE Estimation of \(\sigma^2\)

Use sample variance to estimate \(\sigma^2\) based on the residuals

Sample variance \(\frac{1}{n}\sum_{i = 1}^{n} \varepsilon_i^2\) is unbiased for \(\sigma^2\)

\(\text{MSE} = \frac{1}{n - p} \text{SSE}\) is unbiased for \(\sigma^2\)

\(\text{SSE} = ||Y - X\hat{\beta}||_2^2\) "residual sum of squares"

2.2 DONE Hypothesis testing and confidence regions [5/5]

DONE Testing \(H_0 : A\beta = m\) vs \(H_1 : A\beta \neq m\)

Assume \(A\) is of rank \(r\) and \(\varepsilon \sim N(0, \sigma^2 I_n)\).

Under \(H_0\),

\begin{equation*} A\hat{\beta} \sim N(m, \sigma^2 A(X^{\prime}X)^{-1}A^{\prime}), \end{equation*}

Weighted \(L_2\) distance between \(A\hat{\beta}\) and \(m\) is

\begin{equation*} ||(\sigma^2 A(X^\prime X)^{-1} A^\prime)^{-1/2} (A\hat{\beta} - m)||_2^2 \sim \chi^2_r \end{equation*}

Or written differently,

\begin{equation*} \frac{1}{\sigma^2}(A\hat{\beta} - m)^{\prime}(A(X^{\prime}X)^{-1}A^{\prime})^{-1}(A\hat{\beta} - m) \sim \chi^2_r \end{equation*}

We don't know \(\sigma^2\), so replace with MSE.

Using SVD, we find out

\begin{equation*} \frac{(n-p)\text{MSE}}{\sigma^2} \sim \chi^2_{n - p} \end{equation*}

We then know ratio of two independent \(\chi^2\) is \(F\) distributed. And because \(\hat{\beta} \perp \!\!\! \perp \text{MSE}\), we have

\begin{equation*} F_{r, n-p} = \frac{\chi^2_r / r}{\chi^2_{n-p} / (n-p)} = \frac{(A\hat{\beta} - m)^\prime (A(X^\prime X)^{-1} A^\prime \text{MSE})^{-1}(A\hat{\beta} - m)}{r} \end{equation*}

Reject when \(> F_{r, n-p}(1 - \alpha)\). (A large value of the test statsitic \(F\) leads to a rejection of \(H_0\)).

DONE Singular value decomposition

For any given matrix \(Z \in \mathbb{R}^{m \times n}\) with rank \(r\), \(r \leq \text{min}(m, n)\), there exists

\begin{equation*} Z = U\Sigma V^\prime \\ Z \in \mathbb{R}^{m \times n}, U \in \mathbb{R}^{m \times r}, \Sigma \in \mathbb{R}^{r \times r}, V^\prime \in \mathbb{R}^{r \times n} \end{equation*}

where \(U^\prime U = V^\prime V = I_{r \times r}\)

\(\Sigma\) is diagonal matrix of singular values.

Since \(U^\prime U = I_{r \times r}\), there exists \(\tilde{U} \in \mathbb{R}^{m \times (m - r)}\) such that \(\tilde{U}^\prime \tilde{U} = I_{(m - r) \times (m - r)}\) and \(U^\prime \tilde{U} = 0\). That is to say, the columns of \(U\) and \(\tilde{U}\) form a orthonormal basis for \(\mathbb{R}^n\).

DONE Power of \(H_0 : A\beta = m\) vs \(H_1 : A\beta \neq m\)

Remember

\begin{equation*} \text{Power} = 1 - \text{Type-II Error} \end{equation*}

In other words, power is the probability of rejecting when the alternative is true. This is a good thing, we want high power.

So under the alternative,

\begin{equation*} \frac{(A\hat{\beta} - m)^\prime (A(X^\prime X)^{-1} A^\prime \text{MSE})^{-1}(A\hat{\beta} - m)}{r} \sim F_{r, n-p} (\lambda) \end{equation*}

which is non-central \(F\) distribution with

\begin{equation*} \lambda = (m_1 - m)^\prime (\sigma^2 A (X^\prime X)^{-1} A^\prime)^{-1} (m_1 - m) \\ A\beta = m_1 \neq m \end{equation*}

Then

\begin{equation*} \begin{split} \text{Power} & = 1 - \text{Type-II Error} \\ & = P(F > F_{r, n-p} (1 - \alpha) | \text{H}_1) \\ & = P(F_{r, n-p} (\lambda) > F_{r, n-p} (1 - \alpha)) \end{split} \end{equation*}

DONE Likelihood ratio test

Likelihood ratio test statistic is

\begin{equation*} T = \frac{\underset{A \beta \ = m}{\mathrm{max}} \underset{\sigma^2}{\mathrm{max}} L(Y, X ; \beta, \sigma^2)}{\underset{\beta}{\mathrm{max}} \underset{\sigma^2}{\mathrm{max}} L(Y, X ; \beta, \sigma^2)} \end{equation*}

The likelihood ratio test statistic and the \(F\) test statistic only differ by a constant.

DONE Confidence region for \(A\beta \in \mathbb{R}^r\) with coverage level \(1 - \alpha\)

A pivot is a function of the data and the target quantity.

Idea is to get the distribution of the pivot, then solve inequality with \((1 - \alpha)\) level quantile of distribution to get the quantity of interest.

In \(r = 1\), we get an interval.

For \(r > 1\), we get an ellipsoid.

A \((1 - \alpha)100\%\) confidence region on \(A\beta\) is given by

\begin{equation*} (A\hat{\beta} - A\beta)^{\prime} (A(X^{\prime}X)^{-1}A^{\prime})^{-1} (A\hat{\beta} - A\beta) \leq r \text{MSE} F_{\alpha, r, n-p}. \end{equation*}

If instead we want a confidence interval for \(a^{\prime}\beta\), where \(a^{\prime}\) is a known vector of \(p\) elements, it would be more convenient to use the \(t\) -distribution to derive a \((1 - \alpha)100\%\) confidence interval on \(a^{\prime}\beta\) of the form

\begin{equation*} a^{\prime}\hat{\beta} \pm t_{\frac{\alpha}{2}, n-p} (a^{\prime}(X^{\prime}X)^{-1} a \text{MSE})^{1/2}. \end{equation*}

This is because

\begin{equation*} \frac{a^{\prime}\hat{\beta} - a^{\prime}\beta}{(a^{\prime}(X^{\prime}X)^{-1} a \text{MSE})^{1/2}} \sim t_{n-p} \end{equation*}

2.3 DONE Simultaneous confidence intervals [2/2]

We have discussed a confidence interval with coverage probability \(1 - \alpha\) for a particular linear function, \(a^{\prime}\beta\), of \(\beta\). In some cases, it may be desired to have confidence intervals on all linear functions of \(\beta\) of the form \(l^{\prime}\beta\), where \(l \in \mathbb{R}^p\), the \(p\) -dimensional Euclidean space, with a joint coverage probability of \(1 - \alpha\).

DONE Lemma A

For \(Z \in \mathbb{R}^m\),

\begin{equation*} ||Z||_2 \leq c \Longleftrightarrow |b^\prime Z| \leq ||b||_2 c \quad \forall b \in \mathbb{R}^m \end{equation*}

The proof follows from Cauchy-Schwarz

DONE Scheffes Simultaneous C.I. for \(l'\beta\)

\begin{equation*} l^\prime \hat{\beta} \pm \sqrt{p \cdot \text{MSE} \cdot F_{p, n-p}(1 - \alpha) \cdot l^\prime (X^\prime X)^{-1}l} \quad \forall l \in \mathbb{R}^p \end{equation*}

2.4 DONE Maximum Likelihood Estimation

Multivariate normal pdf for \(Y \sim N(\mu, \Sigma)\)

\begin{equation*} \frac{1}{\sqrt{(2 \pi)^n |\Sigma|}} \text{exp} \left(\frac{-1}{2} (Y - \mu)^\prime \Sigma^{-1} (Y - \mu) \right) \end{equation*}

Review derivative theorems p. 53 Khuri

2.5 DONE Less-than-full-rank linear models [3/3]

In less-than-full rank setting, \(X^{\prime}X\) is no longer nonsingular so we can't have \(\hat{\beta}\) as we have calculated before (we can't estimate \(\beta\) with \(\hat{\beta}\)).

DONE One-way ANOVA example

\begin{equation*} y_{ij} = \mu + \alpha_i + \varepsilon_{ij}, \quad i = 1, 2, \ldots, k, \quad j = 1, 2, \ldots, n_i \end{equation*}

In the one-way ANOVA model, the model matrix \(X\) is not full rank. The rank of \(X\) is \(k < k + 1\).

DONE Generalized inverse of a matrix

Given a matrix \(A \in \mathbb{R}^{m \times n}\), a generalized inverse of \(A\) is any \(B \in \mathbb{R}^{n \times n}\) such that \(ABA = A\).

We denote the generalized inverse by \(A^-\).

The generalized inverse exists for any matrix \(A \in \mathbb{R}^{m \times n}\)

The generalized inverse may not be unique.

If \(A\) is square and invertible, \(B = A^{-1}\).

DONE Properties under less-than-full-rank model

\(X \hat{\beta}\) is unique (recall \(\hat{\beta}\) is not unique).

\(X \hat{\beta}\) can be considered as a projection from \(Y\) onto the subspace \(\{ \lambda \in \mathbb{R}^n : \lambda = X\beta \quad \forall \beta \}\).

\(X(X^\prime X)^- X^\prime\) is not unique.

In the full rank model case, \(X \hat{\beta} = X(X^\prime X)^{-1} X^\prime Y\).

In the less-than-full-rank model case, \(X \hat{\beta} = X(X^\prime X)^- X^\prime Y\).

\begin{equation*} \hat{\beta} = (X^\prime X)^- X^\prime Y \end{equation*}

In the less-than-full-rank model,

\begin{equation*} \text{MSE} = \frac{\text{SSE}}{n - r} \end{equation*}

is unbiased for \(\sigma^2\), where \(\text{rank}(X) = r < p, \quad X \in \mathbb{R}^{n \times p}\).

2.6 DONE Estimable/Testable linear functions [10/10]

Under less-than-full-rank linear model, the regression coefficient \(\beta\) is not estimable. This is because \(X\) is not full column rank. Therefore, we could come up with infinitely estimates for \(\beta\) such that \(X\beta^\ast = x\tilde{\beta}\). Both \(\beta^\ast\) and \(\tilde{\beta}\) give the same distribution for our data \((X, Y)\).

Hence, not every linear function of \(\beta\) of the form \(a^{\prime}\beta\) can be estimated. There are, however, certain conditions on \(a\) under which \(a^{\prime}\beta\) can be estimated.

DONE Estimability

We want to know if a linear function \(a^\prime \beta\) is estimable. We have

\begin{align*} a^\prime \beta \text{ is estimable } & \Longleftrightarrow \text{ if } X\beta_1 = X\beta_2 \text{ then } a^\prime \beta_1 = a^\prime \beta_2 \\ & \Longleftrightarrow \text{ if } X(\beta_1 - \beta_2) = 0 \text{ then } a^\prime (\beta_1 - \beta_2) = 0 \\ & \Longleftrightarrow a \text{ belongs to the rowspace of } X \\ & \Longleftrightarrow a^\prime = b^\prime X \text{ for some } b \end{align*}

If \(a^{\prime}\beta\) is estimable, then \(a^{\prime}\hat{\beta} = a^{\prime}(X^{\prime}X)^{-} X^{\prime}Y\) is invariant to the choice of (XX)^-.

DONE Identifiability

If \(\beta\) is not estimable, then the model is not identifiable.

DONE Estimable linear functions \(a'\beta\)

If \(a^\prime \beta\) is estimable, then \(a^\prime \hat{\beta}\) is unique.

DONE Gauss-Markov theorem for \(a'\beta\)

Assume \(E(\varepsilon) = 0\), \(\text{Cov}(\varepsilon) = \sigma^2 I\).

Suppose \(a^\prime \beta\) is estimable.

Then \(a^\prime \hat{\beta}\) is the unique BLUE, where \(\hat{\beta}\) is any OLS solution.

The proof of this uses a clever transformation to turn the problem into the full-rank case.

DONE Testable hypothesis

\begin{equation*} \text{A hypothesis is testable} \Longleftrightarrow A\beta \text{ is estimable} \end{equation*} \begin{equation*} H_0: A\beta = m \quad H_1: A\beta \neq m \\ A \in \mathbb{R}^{s \times p} \end{equation*}

\(A\beta\) being estimable means each component \(a_i^\prime \beta\) is estimable.

  1. If \(H_0 : A\beta = m\) is testable, then \(\text{rank}(A) \leq r = \text{rank}(X)\).
  2. \(H_0 : A\beta = m\) is testable \(\Leftrightarrow \exists S \in \mathbb{R}^{s \times p}\) such that \(A = SX^\prime X\).

In less-than-full-rank linear model,

\begin{equation*} A\hat{\beta} \sim N(A\beta, \sigma^2 A(X^\prime X)^- A^\prime) \end{equation*} \begin{equation*} F = \frac{(A\hat{\beta} - m)^\prime (A(X^\prime X)A^\prime)^- (A\hat{\beta})}{\text{MSE} \cdot s} \sim F_{s, n-r} \quad \text{under } H_0 \end{equation*} \begin{equation*} \text{Rejection Region } R = \{F > F_{s, n-r}(1 - \alpha) \} \end{equation*} \begin{equation*} F \sim F_{s, n-r}(\lambda) \quad \text{ under } H_1 \\ \lambda = \frac{1}{\sigma^2} (m_1 - m)^\prime (A(X^\prime X)^- A^\prime)^{-1} (m_1 - m) \end{equation*}

This generalization of the \(F\) test statistic recovers what we have shown in the full-rank model case (i.e. when \(r = p\)).

DONE Confidence region for \(A\beta\)

We have \(H_0 : A\beta = m\) versus \(H_1 : A\beta \neq m\), and we assume \(A\beta\) is estimable and \(\varepsilon \sim N(0, \sigma^2 I_n)\). We have that

\begin{equation*} A\hat{\beta} \sim N(A\beta, A(X^{\prime}X)^- A^{\prime} \sigma^2) \end{equation*}

And under \(H_0\),

\begin{equation*} F = \frac{(A\hat{\beta} - m)^{\prime}(A(X^{\prime}X)^- A^{\prime})^{-1}(A\hat{\beta} - m)}{\text{rank}(A) \text{MSE}} \sim F_{\text{rank}(A), n-\text{rank}(X)} \end{equation*}

So the \((1 - \alpha)100\%\) confidence region on \(A\beta\) is given by

\begin{equation*} \frac{(A\hat{\beta} - m)^{\prime}(A(X^{\prime}X)^- A^{\prime})^{-1}(A\hat{\beta} - m)}{\text{rank}(A) \text{MSE}} \leq F_{\text{rank}(A), n - \text{rank}(X)}(1 - \alpha) \end{equation*}

In the special case when \(A = a^{\prime}\), a test statistic concerning \(H_0 : a^{\prime}\beta = m\) versus \(H_1 : a^{\prime}\beta \neq m\) is

\begin{equation*} t = \frac{a^{\prime}\hat{\beta} - m}{(a^{\prime}(X^{\prime}X)^- a \text{MSE})^{1/2}} \end{equation*}

which, under \(H_0\), has the \(t\) -distribution with \(n - \text{rank}(X)\) degrees of freedom. Accordingly, the \((1 - \alpha)100\%\) confidence interval on \(a^{\prime}\beta\) is given by

\begin{equation*} a^{\prime}\hat{\beta} \pm (a^{\prime}(X^{\prime}X)^- a \text{MSE})^{1/2} t_{\frac{\alpha}{2}, n - \text{rank}(X)}. \end{equation*}

DONE Two-way ANOVA model

Two-way cross classification without interaction model is

\begin{equation*} y_{ijk} = \mu + \alpha_i + \beta_j + \varepsilon_{ijk} \end{equation*}

where \(\alpha_i\) and \(\beta_j\) are fixed unknown parameters.

See example 7.2 in Khuri.

DONE Simultaneous confidence interval for estimable linear functions

Simultaneous \((1 - \alpha)100\%\) confidence intervals on all estimable linear functions \(l^{\prime}\hat{\beta}\), where \(l^{\prime}\) belongs to the rowspace of \(A\), are given by

\begin{equation*} l^{\prime}\hat{\beta} \pm \sqrt{(\text{rank}(X) \text{MSE} F_{\text{rank}(X), n - \text{rank}(A)} l^{\prime}(X^{\prime}X)^- l)} \end{equation*}

which are known as Scheffe's simultaneous confidence intervals. In particular, if \(\text{rank}(X) = \text{rank}(A)\), then the elements of \(A\beta\) form a basis for all estimable linear functions of \(\beta\), and we have

\begin{equation*} l^{\prime}\hat{\beta} \pm \sqrt{(\text{rank}(A) \text{MSE} F_{\text{rank}(A), n - \text{rank}(A)} l^{\prime}(X^{\prime}X)^- l)} \end{equation*}

DONE Bonferroni's intervals

Bonferroni's Inequality is

\begin{equation*} \mathbb{P}(\bigcap_{j = 1}^m A_j) \geq 1 - \sum_{j = 1}^{m} \mathbb{P}(A_j^c) \end{equation*}

In some cases, simultaneous confidence intervals on only a fixed number, \(m\), of contrasts may be of interest. Let \(c_j = \sum_{i = 1}^{k} \lambda_{ij}\mu_i (i = 1, 2, \ldots, m)\) be such contrasts. An unbiased estimator of \(c_j\) is \(\hat{c}_j = \sum_{i = 1}^{k} \lambda_{ij} \bar{Y}_i\). The \((1 - \alpha)100\%\) confidence interval on \(c_j\) is then given by

\begin{equation*} \hat{c}_j \pm \left(\text{MSE} \sum_{i = 1}^{k} \frac{\lambda_{ij}^2}{n_i} \right)^{1/2} t_{\sum_{i = 1}^{k} n_i, 1 - \frac{\alpha}{2m}} \end{equation*}

DONE Sidaks Intervals

This is an alternative set of intervals for a fixed number of contrasts. These intervals make use of a theorem that states the following.

Let \(Z = (Z_1, Z_2, \ldots, Z_n)^{\prime} \sim N(0, \Sigma)\), and \(\psi\) be a positive random variable, independent of \(Z\). Then,

\begin{equation*} \mathbb{P} \left( \bigcap_{j = 1}^{m} \{ \frac{|Z_j|}{\psi} \leq \delta_j \} \right) \geq \prod_{j = 1}^m \mathbb{P} \left( \frac{|Z_j|}{\psi} \leq \delta_j \right) \end{equation*}

for any \(\delta_1, \delta_2, \ldots, \delta_m > 0\).

We then get intervals

\begin{equation*} \hat{c}_j \pm \left(\sum_{i = 1}^{k} \frac{\lambda_{ij}^2}{n_i} \text{MSE} \right)^{1/2} t_{\sum_{i = 1}^{k} n_i, 1 - \frac{\alpha}{2}} \end{equation*}

with a probability greater than or equal to \(1 - \alpha_s\), where \(\alpha_s\) is such that \(1 - \alpha_s = (1 - \alpha)^m\). We note that these intervals are of the same form as Bonferroni's intervals, except that in the Bonferroni case, the joint probability of coverage is greater than or equal to \(1 - \tilde{\alpha}\), with \(\tilde{\alpha} = m \alpha\), instead of \((1 - \alpha)^m\) in the Sidak case.

When we compare Sidaks intervals to Bonferronis intervals, we find that Sidaks are better (better precision for same coverage).

2.7 DONE Distributional properties of quadratic forms [10/10]

Suppose \(Y \in \mathbb{R}^n \sim N(\mu, \Sigma), A \in \mathbb{R}^{n \times n}\) is symmetric with \(\text{rank}(A) = r\).

What is the distribution of \(Y^{\prime}AY\)?

Example 1: \(Y_1, \ldots, Y_n \sim N(\mu, \sigma^2)\)

\begin{equation*} \frac{\sum_{i = 1}^n (Y_i - \bar{Y})}{\sigma^2} \sim \chi^2_{n-1}, \\ \sum_{i = 1}^n (Y_i - \bar{Y}) = Y^{\prime}\left(I_n - \frac{1}{n} J_n \right) Y, \\ J_n = I_n I_n^{\prime} \end{equation*}

Example 2: Balanced one-way ANOVA

We can write SSA and SSE as quadratic forms. They are both distributed as independent \(\chi^2\), so their ratio is \(F\) -distributed.

DONE Lemma 1

Let \(Y \in \mathbb{R}^n \sim N(\mu, \Sigma), A \in \mathbb{R}^{n \times n}\) is symmetric with \(\text{rank}(A) = r\).

\(\lambda_1, \lambda_2, \ldots, \lambda_k, (k \leq r)\) are distinct non-zero eigenvalues of \(\Sigma^{1/2}A \Sigma^{1/2}\), with multiplicity \(v_1, \ldots, v_k\).

\(P_i \in \mathbb{R}^{n \times v_i}\) matrix whose columns are the orthonormal eigenvectors of \(\Sigma^{1/2}A \Sigma^{1/2}\) corresponding to \(\lambda_i, (1 \leq i \leq k)\).

Then \(Y^{\prime}A Y = \sum_{i = 1}^k \lambda_i w_i, w_i \sim \chi^2_{v_i}(||\mu^{\prime} \Sigma^{-1/2} P_i ||_2^2)\) and the \(w_i\) are independent.

This says that the distribution of \(Y^{\prime}AY\) in gheneral is a linear combination of independent \(\chi^2\).

DONE Spectral Decomposition

(for symmetric matrix)

\begin{equation*} \Sigma^{1/2} A \Sigma^{1/2} = \begin{pmatrix} P_1 & P_2 & \cdots & P_k \end{pmatrix} \begin{pmatrix} \lambda_1 I_{v_1} & 0 & \cdots & 0 \\ 0 & \lambda_2 I_{v_2} & \cdots & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & \cdots & \lambda_k I_{v_k} \\ \end{pmatrix} \begin{pmatrix} P_1 \\ P_2 \\ \vdots \\ P_k \end{pmatrix} = P \Lambda P^{\prime} \\ P^{\prime}P = I \quad \text{(Columns of } P \text{ are orthonormal} \end{equation*}

DONE Theorem 1

Suppose \(Y \sim N(\mu, \Sigma), A \in \mathbb{R}^{n \times n}\) is symmetric.

Then \(Y^{\prime}AY \sim \chi^2_r(\mu^{\prime}A \mu)\) if and only if \(A\Sigma\) (equivalently \(\Sigma^{1/2}A\Sigma^{1/2}\)) is idempotent and \(\text{rank}(A) = r\).

Normal distribution assumption is required to get \(\chi^2\).

DONE Moment generating function

We use the moment generating function to show equivalence of distributions.

\begin{equation*} f(t) = \mathbb{E}(e^{tx}) \end{equation*}

DONE Corollary \(A \in \mathbb{R}^{n \times n}\)

  1. \(Y \sim N(0, \Sigma)\), then \(Y^{\prime}AY \sim \chi^2_r \Leftrightarrow A\Sigma\) is idempotent, \(\text{rank}(A) = r\).
  2. \(Y \sim N(\mu, \Sigma)\), then \(Y^{\prime} \Sigma^{-1}Y \sim \chi^2_n(\mu^{\prime}\Sigma^{-1}\mu)\).
  3. \(Y \sim N(\mu, \sigma^2I)\), then \(Y^{\prime} \frac{A}{\sigma^2}Y \sim \chi^2_r \left(\frac{\mu^{\prime} A \mu}{\sigma^2} \right) \Leftrightarrow A \text{ is idempotent }, \text{ rank}(A) = r\)

DONE Theorem 2 (Craig's Theorem)

Suppose \(Y \in \mathbb{R}^n \sim N(\mu, \Sigma), A,B \in \mathbb{R}^{n \times n}\) are two symmetric matrices. Under what conditions are \(YA^{\prime}Y\) and \(Y^{\prime}BY\) independent?

\begin{equation*} Y^{\prime}AY \perp Y^{\prime}BY \Leftrightarrow A\Sigma B = 0. \end{equation*}

Gaussianity is required for both directions.

DONE Theorem 3.9 Khuri

\(C, D \in \mathbb{R}^{n \times n}\) are symmetric.

\(CD = DC \Rightarrow C\) and \(D\) commute.

\(CD = DC \Rightarrow C = P \Lambda P^{\prime}\) and \(D = P \tilde{\Lambda} P^{\prime}\). (They share the same eigenvector matrices \(P\)).

DONE Theorem 3

Suppose \(Y \in \mathbb{R}^n \sim N(\mu, \Sigma), A \in \mathbb{R}^{n \times n}\) is symmetric, \(B \in \mathbb{R}^{m \times n}\). Under what condition are \(YA^{\prime}Y\) and \(BY\) independent?

For example, you may want to prove MSE and \(\hat{\beta}\) are independent.

\begin{equation*} Y^{\prime}AY \perp BY \Leftrightarrow B\Sigma A = 0. \end{equation*}

DONE Partitioning the sum of squares [1/1]

\begin{equation*} Y_i = X_i^{\prime}\beta_1 + \beta_0 + \varepsilon_i, \quad i = 1, 2, \ldots, n, \\ \bar{Y} = \frac{1}{n} \sum_{i = 1}^n Y_i, \quad \hat{Y}_i = X_i^{\prime}\hat{\beta}_1 + \hat{\beta}_0. \end{equation*}

Then we can partition the total sum of squares as

\begin{equation*} \sum_{i = 1}^n (Y_i - \bar{Y})^2 = \sum_{i = 1}^n (Y_i - \hat{Y}_i)^2 + \sum_{i = 1}^n (\hat{Y}_i - \bar{Y})^2 \\ \text{SSTO} = \text{SSE} + \text{SSR} \end{equation*}

and since \(\sum_{i = 1}^n (Y_i - \bar{Y})^2 = \sum_{i = 1}^n Y_i^2 - n\bar{Y}\), we can write

\begin{equation*} \sum_{i = 1}^n Y_i^2 - n\bar{Y} = \sum_{i = 1}^n (Y_i - \hat{Y}_i)^2 + \sum_{i = 1}^n (\hat{Y}_i - \bar{Y})^2 \\ \downarrow \\ \sum_{i = 1}^n Y_i^2 = n\bar{Y}^2 + \sum_{i = 1}^n (Y_i - \hat{Y}_i)^2 + \sum_{i = 1}^n (\hat{Y}_i - \bar{Y})^2 \\ Y^{\prime}I_n Y = Y^{\prime} \frac{1}{n}J_n Y + Y^{\prime} \left(I_n - X\left(X^{\prime}X\right)^{-}X^{\prime}\right)Y + Y^{\prime}\left(X\left(X^{\prime}X\right)^{-}X^{\prime} - \frac{1}{n} J_n\right)Y \\ X = \begin{pmatrix} 1 & x_1^{\prime} \\ 1 & x_2^{\prime} \\ 1 & x_3^{\prime} \\ \vdots & \vdots \\ 1 & x_p^{\prime} \\ \end{pmatrix} \in \mathbb{R}^{n \times p} \end{equation*}

Here we have

\begin{equation*} Y^{\prime}Y = Y^{\prime}A_1 Y + Y^{\prime} A_2 + Y^{\prime} A_3 Y, \quad A_1 + A_2 + A_3 = I_n \end{equation*}
☑ One-way ANOVA example
\begin{equation*} y_{ij} = \mu + \alpha_i + \varepsilon_{ij}, \quad i = 1, 2, \ldots, k, \quad j = 1, 2, \ldots m \\ \bar{Y}_i = \frac{1}{m} \sum_{j = 1}^m y_{ij}, \quad \bar{Y} = \frac{1}{km} \sum_i \sum_j y_{ij}, \\ Y = (y_{11}, y_{12}, \ldots, y_{1m}, \ldots, y_{k1}, \ldots, y_{km})^{\prime} \\ \sum_{i = 1}^k \sum_{j = 1}^m (y_{ij} - \bar{Y})^2 = \sum_{i = 1}^k \sum_{j = 1}^m (y_{ij} - \bar{Y}_i)^2 + \sum_{i = 1}^k \sum_{j = 1}^m (\bar{Y}_i - \bar{Y})^2 \\ \sum_{i = 1}^k \sum_{j = 1}^m (y_{ij} - \bar{Y})^2 = \sum_{i = 1}^k \sum_{j = 1}^m y_{ij}^2 - km \bar{Y}^2 \\ \downarrow \\ Y^{\prime}Y = Y^{\prime}\left(\frac{1}{km} J_{km} \right) Y + Y^{\prime}\left( I_{km} - \frac{1}{m} B\right) Y + Y^{\prime} \left( \frac{1}{m} B - \frac{1}{km} J_{km}\right) Y \end{equation*}

where \(B\) is a block diagonal matrix with elements equal to \(J\).

DONE Cochran's theorem

Let \(Y \in \mathbb{R}^n \sim N(\mu, \sigma^2 I_n)\), and \(Y^{\prime}Y = \sum_{i = 1}^k Y^{\prime}A_i Y\), with \(\sum_{i = 1}^k A_i = I_n\), \(\text{rank}(A_i) = r_i, A_i \in \mathbb{R}^{n \times n}\) is symmetric, \(i = 1, 2, \ldots, k\).

(a) \(n = \sum_{i = 1}^k r_i\)

(b) \(\frac{1}{\sigma^2} Y^{\prime}A_i Y \sim \chi^2_{r_i} \left(\frac{1}{\sigma^2} \mu^{\prime} A_i \mu \right), \quad i = 1, 2, \ldots, k\)

(c) \(Y^{\prime}A_1 Y, \ldots, Y^{\prime}A_k Y\) are mutually independent

\begin{equation*} \text{(a)} \Leftrightarrow \text{(b)} \Leftrightarrow \text{(c)} \end{equation*}

2.8 DONE Model selection and prediction [10/10]

\begin{equation*} Y_i = X_i^{\prime} \beta + \varepsilon_i, i = 1, 2, \ldots, n \\ \beta = (\beta_1, \ldots, \beta_p)^{\prime} \end{equation*}

Some of the \(\beta_i\) 's are zero. That means some predictors are not related to the response \(Y\).

Identifying the \(\beta_i\) 's that are zero helps with interpretation and improving statistical performance (reducing noise).

DONE Expected prediction error

Write out the model in matrix form: \(Y = X\beta + \varepsilon\).

Let

\begin{equation*} \beta = \begin{pmatrix} \beta_1 \\ \beta_2 \end{pmatrix}, X = \begin{pmatrix} X_1 & X_2 \end{pmatrix}, \\ \beta_1 \in \mathbb{R}^{p_1}, \beta_2 \in \mathbb{R}^{p_2} \end{equation*}

Suppose there is a new data point \(Y_a, a \in \mathbb{R}^p\). We observe new \(a\), want to predict \(Y_a\).

Given an estimator \(\tilde{\beta}\),

\begin{equation*} \text{Prediction Error} = \mathbb{E}_{Y_a}(Y_a - a^{\prime}\tilde{\beta})^2 \\ \text{Expected Prediction Error} = \mathbb{E}(\text{PE}) = \mathbb{E}(Y_a + a^{\prime}\tilde{\beta})^2 \end{equation*}

Let \(a = \begin{pmatrix} a_1 \\ a_2 \end{pmatrix}, a_1 \in \mathbb{R}^{p_1}, a_2 \in \mathbb{R}^{p_2}\).

\begin{align*} Y_a &= a^{\prime}\beta + \varepsilon_a \\ &= a_1^{\prime}\beta_1 + a_2^{\prime}\beta_2 + \varepsilon_a \end{align*}

DONE Prediction [2/2]

☑ Case 1: Using only \(p_1\) variables
\begin{align*} \mathbb{E}(\text{PE}) &= \mathbb{E}(Y_a - a_1^{\prime}(X_1^{\prime}X_1)^{-1}X_1 Y)^2 \\ \vdots \\ &= \sigma^2 + \sigma^2 a_1^{\prime}(X_1^{\prime}X_1)^{-1}a_1 + [(a_1^{\prime}(X_1^{\prime}X_1)^{-1}X_1^{\prime} X_2 - a_2^{\prime}) \beta_2]^2 \end{align*}
☑ Case 2: Using all the variables
\begin{align*} \mathbb{E}(\text{PE}) &= \mathbb{E}(Y_a - a^{\prime}(X^{\prime}X)^{-1}X Y)^2 \\ &= \sigma^2 + \sigma^2 a^{\prime}(X^{\prime}X)^{-1}a \end{align*}

If we then partition the second term in terms of \(a_1\) and \(a_2\), we find that the expected prediction error in case 2 is larger than in case 1. So case 1 gives small prediciton error.

DONE Bias-variance tradeoff

Given an estimator \(\tilde{\beta}\),

\begin{align*} \mathbb{E}(\text{PE}) &= \mathbb{E}(Y_a - a^{\prime}\tilde{\beta})^2 \\ &= \sigma^2 + (a^{\prime}\mathbb{E}(\tilde{\beta}) - a^{\prime}\beta)^2 + \text{Var}(a^{\prime}\tilde{\beta}) \\ &= \text{Irreducible error} + \text{Bias}^2 + \text{Variance} \end{align*}

DONE Goodness of Fit [2/2]

☑ Coefficient of Determination
\begin{equation*} R^2 = 1 - \frac{\sum_{i = 1}^n (Y_i - \hat{Y}_i)^2}{\sum_{i = 1}^n (Y_i - \bar{Y})^2} = 1 - \frac{\text{SSE}}{\text{SSTO}} \end{equation*}

Among all variation in the data, how much can be explained by your model?

\(R^2 \rightarrow\) the proportion of variation in \(Y\) that's explained by the model.

\(R^2\) is not a great measure to select models because you will always select the model with more parameters.

☑ Adjusted \(R^2\)

Adjusts for model complexity

\begin{equation*} \bar{R}^2 = 1 - (1 - R^2) \frac{n - 1}{n - k - 1}, \end{equation*}

where \(k\) is the total number of predictors in the model. (By default, intercept is included). (\(k\) predictors plus intercept is \(k + 1\)).

  1. \(\bar{R}^2\) is not monotonic in \(k\).
  2. ajusted \(R^2\) is typically smaller than \(R^2\).
  3. \(\bar{R}^2 = 1 - \frac{\sum_{i = 1}^n (Y_i - \hat{Y}_i)^2 / (n - k - 1)}{\sum_{i = 1}^n (Y_i - \bar{Y})^2 / (n - 1)}\), where the numerator is MSE and the denomitor is sample variance. So when we compare models with adjusted \(R^2\), we are basically comparing with MSE.
  4. \(\bar{R}^2\) can be motivated from the hypothesis testing problem. Basically testing if a subset of the \(\beta\) 's are zero.

DONE Predicition error [2/2]

☑ Expected in-sample prediction error

\((Y, X) \Rightarrow Y = X\beta + \varepsilon, \mathbb{E}(\varepsilon) = 0, \text{Cov}(\varepsilon) = \sigma^2 I\).

Suppose we observe new data on the same \(X\) 's.

\((\tilde{Y}, \tilde{X}) \Rightarrow \tilde{Y} = X\beta + \tilde{\varepsilon}\).

\begin{equation*} \text{Expected in-sample predcition error} = \frac{1}{n} \mathbb{E}||\tilde{Y} - \hat{f}(Y)||_2^2 \end{equation*}

where \(\hat{f}(Y)\) is the predicted value based on data \((Y, X)\).

If our method is OLS using a submatrix of \(X\) denoted \(X_s\), then

\begin{align*} \text{EISPE} &= \frac{1}{n} \mathbb{E}||\tilde{Y} - X_s (X_s^{\prime} X_s)^{-1}X_s^{\prime} Y||_2^2 \\ \vdots \\ &= \sigma^2 + \frac{1}{n} \beta^{\prime}X^{\prime}(I - X_s(X_s^{\prime}X_s)^{-1}X_s^{\prime})X\beta + \frac{k+1}{n} \sigma^2. \end{align*}

which we can estimate.

A general formula for in-sample prediction error:

Consider \(Y_i = f(X_i) + \varepsilon_i, i = 1, \ldots, n, \mathbb{E}(\varepsilon) = 0, \text{Cov}(\varepsilon) = \sigma^2 I\). Let \(\hat{Y}_i\) be the prediction for the response corresponding to \(X_i, i = 1, \ldots, n\). New data is \(\tilde{Y}_i = f(X_i) + \tilde{\varepsilon}_i, i = 1, \ldots, n\).

\begin{align*} \text{Expected in-sample prediction error} &= \mathbb{E}\left(\sum_{i = 1}^n (\hat{Y}_i - Y_i)^2\right) + \frac{2}{n} \sum_{i = 1}^n \text{Cov}(Y_i, \hat{Y}_i) \\ &= \text{training error} + \text{complexity of the model}. \end{align*}
☑ Mallon's Cp
\begin{equation*} Cp = \frac{\text{SSE}_0}{\text{MSE}} + 2(k+1) - n, \\ \text{SSE}_0 = ||Y - X\hat{\beta}_s||_2^2, \text{MSE} = \frac{1}{n - p - 1}||Y - X\hat{\beta}||_2^2 \end{equation*}

where \(\hat{\beta}_s\) is OLS based on a subset \(s\) of predictors.

\(k\) is a measure of model complexity.

Remarks:

  1. If the bias is small (close to zero), \(\mathbb{E}(Cp) \approx k + 1\). Otherwise it is much larger than \(k + 1\) if the bias is large.
  2. A good strategy is to identify models whose \(Cp\) is close to \(k + 1\), and to pick the one having the smallest \(Cp\).
  3. Approximately the smallest \(Cp - (k+1)\) corresponds to the largest \(\bar{R}^2\).

DONE Degrees of freedom of given prediction method

This is a measure of complexity of the model.

\begin{equation*} df(\hat{Y}) = \frac{\sum_{i = 1}^n \text{Cov}(Y_i, \hat{Y}_i)}{\sigma^2} \end{equation*}

DONE Cross validation

See elements of statsitical learning Ch 7.10 for Cross Validation. Chapter 7 is relevant for model selection.

Training data is \((x_i, y_i), i = 1, \ldots, n\).

If we have a new dataset available: \((\tilde{x}_i, \tilde{y}_i), i = 1, \ldots, m\), we can estimate the prediction error by \(\frac{1}{m} \sum_{i = 1}^m (\tilde{y}_i - \hat{f}(\tilde{x}_i))^2\), where \(\hat{f}(\cdot)\) is any prediction method using the training set.

Usually a new dataset is not available, so we split the training set into two parts. We don't want either part to be too small.

A better solution is cross validation, where we randomly split the training set into \(k\) roughly equal parts, fit the model using \(k-1\) of the parts, and calculate the prediction error on the left out part. We then do this for all \(k\) parts. This is called \(k\) -fold cross validation.

If we let \(g(i), 1 \leq i \leq n\) index which part the \(i^{\text{th}}\) data point is assigned to, and \(\hat{f}^{-j}(\cdot)\) denote the method computed with the \(j^{\text{th}}\) part removed.

Then the cross-validation estimate of (Expected) prediction error is

\begin{equation*} CV(\hat{f}) = \frac{1}{n} \sum_{i = 1}^n (y_i - \hat{f}^{-g(i)}(x_i))^2 \end{equation*}

Large \(k\): low bias, high variance

Small \(k\): high bias, low variance

For \(k = n\), it is called leave-one-out cross validation.

For model selection, we choose the model with \(CV(\hat{f}(\cdot))\) minimized.

DONE Information Criterion [6/6]

Suppose we have data \(Y_1, \ldots, Y_n\) independently drawn from some density \(f(Y)\). A model \(\mathcal{M}\) is a set of densities: \(\mathcal{M} = \{ p(Y;\theta) : \theta \in \Theta\}\). \(f(Y)\) may or may not belong to \(\mathcal{M}\).

☑ Akaike Information Criterion

For a given model \(\mathcal{M}_j\), the corresponding AIC is

\begin{equation*} \text{AIC}(\mathcal{M}_j) = 2 \sum_{i = 1}^n \log p(y_i ; \hat{\theta}_j) - 2 \text{dim}(\mathcal{M}_j) \end{equation*}

We then pick the model with the greatest AIC.

☑ Kullback - Leibler divergence

How to measure the discrepancy between \(p(Y;\theta)\) and \(f(Y)\)?

\begin{align*} KL(f(Y), p(Y;\theta)) &= \int f(Y) \log{\frac{f(Y)}{p(y;\theta)}} dY \\ &= \mathbb{E}_f\left[ \log{\frac{f(y)}{p(Y;\theta)}}\right] \end{align*}

\(KL\) is non-negative, and \(KL = 0\) for equivalent distributions.

\(KL\) is generally asymmetric.

☑ Takeuchi's Information Criterion

What if we don't assume \(f(y) \in \mathcal{M}_j\)? (Model may not be correctly specified).

\begin{equation*} \text{TIC}(\mathcal{M}_j) = 2 \sum_{i = 1}^n \log p(y_i ; \hat{\theta}_j) - 2 tr(\hat{J}^{-1}\hat{V}), \\ \hat{J} = -\frac{1}{n}\sum_{i = 1}^n H(Y_i ; \hat{\theta}_j), \hat{V} = \frac{1}{n} \sum_{i = 1}^n G(Y_i ; \hat{\theta}_j) G^{\prime}(Y_i ;\hat{\theta}), \\ H(Y_i, \theta) = \frac{\partial^2 \log p(Y_i;\theta)}{\partial \theta^2}, G(Y_i, \theta) = \frac{\partial \log p(Y_i;\theta)}{\partial \theta} \end{equation*}
☑ AIC For Linear Regression Model

Suppose \(\mathcal{M}_j = \{ Y \sim N(X\beta, \sigma^2 I_n), \beta \in \mathbb{R}^{k + 1} \}\).

If \(\sigma^2\) is unknown, then (up to constants)

\begin{equation*} \text{AIC}(\mathcal{M}_j) = -n \log \frac{\text{SSE}}{n} - 2k. \end{equation*}

If \(\sigma^2\) is assumed known, then

\begin{equation*} \text{AIC}(\mathcal{M}_j) \approx -\frac{\text{SSE}}{\text{MSE}} - 2k. \end{equation*}
☑ Bayesian Information Criterion

Bayesian perspective: specify prior probability and distributions for \(\mathcal{M}_j\) and \(\theta_j, j = 1, 2, \ldots, k\). We then want the model that maximizes the posterior probability. We get

\begin{equation*} \text{BIC}(\mathcal{M}_j) = 2 \sum_{i = 1}^n \log p(y_i ; \hat{\theta}_j) - \text{dim}(\theta_j) \log n \end{equation*}

Want to pick model with largest BIC or AIC.

☑ Laplace's Approximation

We used this to approximate the posterior probability for BIC. It can approximate integrals of a certain form as normal distributions.

\begin{equation*} (\theta \in \mathbb{R}^p) \\ \int_\Theta e^{n h(\theta)} g(\theta) d\theta \approx \left( \frac{2\pi}{n} \right)^{\frac{p}{2}} e^{n h(\theta_0)} g(\theta_0) |J(\theta_0)|^{\frac{-1}{2}}, \\ \theta_0 = \arg \max_\theta h(\theta), \quad -J(\theta_0) = \frac{\partial^2 h}{\partial \theta^2} |_{\theta = \theta_0} \end{equation*}

DONE Comparison of model selection criteria for linear regression model

Consider linear regression model: \(Y = X\beta + \varepsilon, Y \in \mathbb{R}^n, X \in \mathbb{R}^{n \times p}, \varepsilon \sim N(0, \sigma^2 I_n)\), and assume some of the \(\beta_j\) are zero.

Let the true model be \(\mathcal{M}_0 = \{ j : \beta_j \neq 0 \}\) and \(J = \{ \mathcal{M}_0, \mathcal{M}_1, \ldots, \mathcal{M}_k \}\).

Partition \(J\) into two groups:

\begin{equation*} J_1 = \{ \mathcal{M} \in J : \mathcal{M}_0 \nsubseteq \mathcal{M} \} \\ J_2 = \{ \mathcal{M} \in J : \mathcal{M}_0 \subseteq \mathcal{M} \} \\ \end{equation*}

Let \(\hat{\mathcal{M}}\) be the model selected by optimizing a model selection criterion (e.g. AIC, BIC, etc.) and \(R_n = \mathbb{E}||X\beta - X_{\hat{\mu}} \hat{\beta}_{\hat{\mu}}||_2^2\), where \(X_{\hat{\mu}} \hat{\beta}_{\hat{\mu}}\) is the OLS prediction from the selected model. Ideally, if we select \(\mathcal{M}_0, R_n = \sigma^2 |\mathcal{M}_0 |\), where \(|\mathcal{M}_0 |\) is the dimension of model \(\mathcal{M}_0\).

Under mild regularity conditions, for Cp, AIC, LOOCV:

  1. For given \(\mathcal{M} \in J_1\), and any \(h>0, \mathbb{P}(\hat{\mathcal{M}} = \mathcal{M}) = o(n^{-k})\) as \(n \rightarrow \infty\).
  2. \(\lim_{n \rightarrow \infty} \sum_{\mathcal{M} \in J_2, \mathcal{M} \neq \mathcal{M}_0} \mathbb{P}(\hat{\mathcal{M}} = \mathcal{M}) > 0\) if \(\mathcal{M}_0\) is not the full model.
  3. \(\lim_{n \rightarrow \infty} R_n > \sigma^2 |\mathcal{M}_0 |\) if \(\mathcal{M}_0\) is not the full model.

For BIC:

  1. For all \(\mathcal{M} \in J_1\) and \(h > 0\), \(\mathbb{P}(\hat{\mathcal{M}} = \mathcal{M}) = o(n^{-h})\) as \(n \rightarrow \infty\).
  2. \(\lim_{n \rightarrow \infty} \sum_{\mathcal{M} \in J_2, \mathcal{M} \neq \mathcal{M}_0} \mathbb{P}(\hat{\mathcal{M}} = \mathcal{M}) = 0\)
  3. \(\lim_{n \rightarrow \infty} R_n = \sigma^2 |\mathcal{M}_0 |\)

DONE Subset selection

Consider linear regression model with \(p\) many predictors.

Best subset selection is when we try all \(2^p\) subsets and pick the one optimizing a model selection criterion. This is computationally expensive.

Forward selection is when we start with the null model, and add the single variable that optimizes a model selection criterion, and keep going until we can't optimize any more. This isn't great because we have to try so many models.

Backward elimination starts with the full model and selects single variables to remove to optimize a criterion.

Stepwise selection basically combines forward and backward selection.

2.9 DONE Shrinkage methods [3/3]

DONE James-Stein estimator

Consider \(Y \in \mathbb{R}^n \sim N(\mu, I_n)\). How can we estimate \(\mu \in \mathbb{R}^n\)?

\(Y\) is unbiased and UMVUE for \(\mu\), but let's consider some biased estimator.

Specifically, consider \(\tilde{\mu} = cY \text{ for } 0 < c < 1\), which is biased.

The James-Stein Estimator is

\begin{equation*} \hat{\mu}_{JS} = \left( 1 - \frac{n-2}{||Y||_2^2} \right) Y \end{equation*}

When \(n \geq 3\), \(\mathbb{E}||\hat{\mu}_{JS} - \mu||_2^2 < \mathbb{E}||\hat{\mu}_{MLE} - \mu||_2^2\) for every choice of \(\mu \in \mathbb{R}^n\).

For example, in linear regression with \(\sigma^2 = 1, X^{\prime}X = I_p \Rightarrow \hat{\beta} \sim N(\beta, I_p)\), and we have

\begin{equation*} \hat{\beta}_{JS} = \left( 1 - \frac{p-2}{||Y\hat{\beta}||_2^2} \right) \hat{\beta} \end{equation*}

DONE Ridge Regression

Consider the regression model: \(Y_i = \beta_0 + X_i^{\prime}\beta_1 + \varepsilon_i, \quad i = 1, \ldots, n, \beta0 \in \mathbb{R}, \beta_1 \in \mathbb{R}^p\).

\begin{equation*} \hat{\beta}^\text{ridge} = \arg \min_{\beta_0, \beta_1} \sum_{i = 1}^n (Y_i - \beta_0 - X_i^{\prime}\beta_1)^2 + \lambda ||\beta_1||_2^2 \end{equation*}

Here, \(\lambda ||\beta_1||_2^2\) is the penalty term with \(\lambda\) as the tuning parameter. Larger \(\lambda\) means more shrinkage, while \(\lambda = 0\) gives OLS. As \(\lambda \rightarrow \infty\), \(\hat{\beta}_1 \rightarrow 0\).

An equivalent form for \(\hat{\beta}^\text{ridge}\) is

\begin{equation*} \hat{\beta}^\text{ridge} = \arg \min_{\beta_0, \beta_1} \sum_{i = 1}^n (Y_i - \beta_0 - X_i^{\prime}\beta_1)^2 \text{ subject to } ||\beta_1 ||_2^2 \leq t \end{equation*}

where without the constraint, we get OLS, and smaller \(t\) gives larger shrinkage.

We usually do data standardization before doing ridge regression, which gives us the "scaling invariance" property.

\begin{equation*} \tilde{X}_{ij} = \frac{X_{ij} - \bar{X}_j}{\sqrt{\sum_{i = 1}^n (X_{ij} - \bar{X}_j)^2}}. \end{equation*}

After data standardization, \(\hat{\beta}_0^\text{ridge} = \hat{\beta}_0^\text{OLS}\).

Shrinkage can possibly improve estimation/prediction.

See Chapter 3 in Elements of Statistical Learning.

DONE The Lasso

Lasso is the same as Ridge Regression except we replace \(||\beta_1||_2^2\) by \(||\beta_1||\), which makes the solution nonlinear in \(Y\). No closed form solution in general.

3 STT 868 [100%]

Often, data have a clustered structure. Classical statsitics assums that observations are independent and identically distrivuted (iid). Applied to clustered datam this assumption may lead to false results. Mixed effects model addresses this issue.

3.1 DONE Model Identifiability [10/10]

DONE One-way ANOVA model

One-way ANOVA is an example of a fixed effects model.

\begin{equation*} y_{ij} = \mu + \alpha_i + \varepsilon_{ij}, \\ i = 1 \ldots k, j = 1, \dots n_i, \\ \mu_i = \mu + \alpha_i \end{equation*}

\(\mu_i\) are fixed effects. They are estimable.

DONE Variance componenets models

This is a linear mixed effects (LME) model.

Same model as one-way ANOVA, but \(\mu_i \overset{\text{iid}}{\sim} N(\mu, \sigma^2) \Leftrightarrow \alpha_i \overset{\text{iid}}{\sim} N(0, \sigma^2)\)

The \(\mu_i\) are random effects

DONE Properties of random effects

  1. induce dependency structure
  2. bring fewer parameters to the model
  3. different estimation and inference

Deciding between fixed and random effects

To decide whether a set of effects is fixed or random, the context of the daa, the manner in which they were gathered and the environment from which they came are the determining factors. The important question is: are the levels of the factor going to be considered a random sample from a population of values which have a distribution? If yes, then the effects should be considered random effects.

Other questions that can inform our decision are:

Do the levels of a factor come from a probability distribution? If yes, random effects.

Is there enough information about a factor to decide that the levels of it in the data are like a random sample? If yes, random effects.

DONE Random intercept model

This may occur in the setting of a longitudinal study of \(N\) patients, where each patient has their own covariates, such as age, gender, weight, diet, and physical performance. A key assumption is that the relationship between the covariates and the response does not vary from patient to patient (no random coefficients). However, when the study begins, each patient may have a different baseline response value. A random intercept model allows us to handle these different baselines.

\begin{equation*} Y_{ij} = \mu_i + X_{ij} \beta + \varepsilon_{ij} \\ i = 1, \ldots N, j = 1, \ldots n_i \\ \mu_i \overset{\text{iid}}{\sim} N(\mu, \sigma_a^2) \end{equation*}

DONE Linear mixed model

\begin{equation*} Y_i = X_i \beta + Z_ib_i + \varepsilon_i \\ i = 1, \ldots N \end{equation*}
\(Y_i \in \mathcal{R}^{n_i}\) the response of the \(i^\text{th}\) group/cluster/subject
\(X_i \in \mathcal{R}^{n_i \times m}\) the design matrix for the \(i^\text{th}\) group/cluster/subject
\(\beta \in \mathcal{R}^m\) fixed effects coefficients
\(Z_i \in \mathcal{R}^{n_i \times k}\) design matrix of random effects for the \(i^\text{th}\) group/cluster/subject
\(b_i \in \mathcal{R}^k\) random effects (random)
\(\varepsilon_i \in \mathcal{R}^{n_i}\) error term (random)
☑ Basic assumptions
  1. Random effects and errors are mutually indpendent.
  2. \(E(\varepsilon_i) = 0\) and \(\text{Cov}(\varepsilon_i) = \sigma^2 I_{n_i}\)
  3. \(E(b_i) = 0\) and \(\text{Cov}(b_i) = \sigma^2 D\)

We know \(\{Y_i, X_i, Z_i, i = 1, \ldots, N\}\)

Unknown \(\{\sigma^2, \beta, b_i, D \}\)

☑ Stacked Form

\(N\) equations given above can be compressed into one as

\begin{equation*} Y = X\beta + Zb + \varepsilon \end{equation*}

after stacking the data in a single vector and matrix form as follows:

\begin{equation*} Y = \begin{bmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{bmatrix}, X = \begin{bmatrix} X_1\\ X_2\\ \vdots\\ X_N \end{bmatrix}, Z = \begin{bmatrix} Z_1 \quad 0 \quad 0\\ 0 \quad \ddots \quad 0\\ 0 \quad 0 \quad Z_N \end{bmatrix}, b = \begin{bmatrix} b_1\\ b_2\\ \vdots\\ b_N \end{bmatrix}, \varepsilon = \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{bmatrix} \end{equation*}

with \(\text{Cov}(b) = \sigma^2 (I \otimes D)\) and

\(\text{Cov}(Y) = \text{diag}(\text{Cov}(Y_i))\)

\(\text{Cov}(Y_i) = \text{Cov}(X_i \beta + Z_i b_i + \varepsilon_i) = \text{Cov}(Z_i b_i) + \text{Cov}(\varepsilon) = \sigma^2_i Z_i D Z_i^{\prime} + \sigma^2 I_{n_i} = \sigma^2 (Z_i D Z_i^{\prime} + i)\)

We can think of \(Zb + \varepsilon\) as \(\tilde{\varepsilon}\)

DONE Identifiability definition

A staticial model for data \(Y\) is defined by a family of distributions of \(Y : \{P_\theta : \theta \in \Theta \}\)

\(\theta\) is a parameter and \(\Theta\) is the parameter space.

A model is identifiable on \(\Theta\) if \(P_{\theta_1} = P_{\theta_2} \Rightarrow \theta_1 = \theta_2\) for \(\theta_1, \theta_2 \in \Theta\).

We must have one-to-one mapping from parameter space to model space.

Identifiability is important, otherwise we could produce same estimator from different model parameter values.

It may be the case that only some parameter values are identifiable.

DONE Identifiability theorem for LMM

LMM is identifiable if

  1. \(X = (X_1, X_2, \ldots, X_N)^\prime\) has full column rank
  2. At least one \(Z_i\) has full column rank
  3. \(\sum_{i = 1}^N (n_i - k) > 0\) (\(Z\) must be a tall matrix)

The statement may not hold in the opposite direction.

To prove this, we assumed two models with different parameter values and showed that if they have the same distribution, then the parameter values are the same. We can do this by equating expectations and blocks of covariance matrices.

DONE Kronecker Product

Just remember that \(A \otimes B\) means each element of \(A\) gets multiplied by matrix \(B\).

\((A \otimes B)^{\prime} = A^{\prime} \otimes B^{\prime}\).

DONE Vectorization

Just stack up columns into a vector, starting with first column.

\begin{equation*} \text{Vec}(ABC) = (B^\prime \otimes A) \text{Vec}(C) \end{equation*}

Any \(k^2\) -dimensional vector can be written as a vectorization of a \(k \times k\) matrix.

\(\text{Vec}^{\prime}(A) \text{Vec}(B) = tr(AB^{\prime}) = tr(A^{\prime}B)\).

\(\text{Vec}^{\prime}(A) \text{Vec}(A) = ||A||_F^2\).

\(||A||_F^2\) is sum of squares of all elements in \(A\).

\(\text{Vec}(ACB) = (B^{\prime} \otimes A) \text{Vec}(C)\).

\(\sum_{i = 1}^n \text{Vec}(A_i) = \text{Vec}(\sum_{i = 1}^n A_i)\)

DONE Schur Complement Condition

\begin{equation*} {\begin{pmatrix} A & B\\ B^\prime & C \end{pmatrix} \succ 0} \Leftrightarrow {A \succ 0 \\ C - B^\prime A^{-1} B \succ 0} \end{equation*}

3.2 DONE MLE, ANOVA estimation, MINQUE, RMLE [7/7]

DONE Log-likelihood of LMM

Under normaility assumptions,

\begin{equation*} Y_i \sim N(X_i \beta, \sigma^2 (I + Z_i D Z_i^\prime)) \\ i = 1, \ldots, N \end{equation*} \begin{equation*} \mathbb{P}(Y_i) = (2 \pi)^{-n_i / 2} | \sigma^2 (Z_i D Z_i^{\prime} + I_{n_i}) |^\frac{-1}{2} e^ {\frac{-1}{2} (Y_i - X_i \beta)^{\prime} (\sigma^2 (Z_i D Z_i^{\prime} + I_{n_i}))^{-1} (Y_i - X_i \beta)} \end{equation*}

The log-likelihood function is given by

\begin{equation*} l(\theta) = \sum_{i = 1}^{N} \left[ \frac{-n_i}{2} \log \sigma^2 - \frac{1}{2} \log |Z_i D Z_i^\prime + I_{n_i} | - \frac{\sigma^2}{2} (Y_i - X_i \beta)^\prime (Z_i D Z_i^\prime + I_{n_i})^{-1} (Y_i - X_i \beta) \right] \end{equation*}

The MLE is the solution to \(\text{argmax} l(\theta)\) over \(\theta \in \Theta\).

DONE MLE \(\hat{\beta}\) [2/2]

\begin{equation*} \hat{\beta} = \left[ \sum_{i = 1}^{n} X_i^\prime (Z_i \hat{D} Z_i^\prime + I_{n_i})^{-1} X_i \right]^{-1} \left[ \sum_{i = 1}^{n} X_i^\prime (Z_i \hat{D} Z_i^\prime + I_{n_i})^{-1} Y_i \right] \end{equation*}
☑ when \(\hat{D} = 0\)
\begin{equation*} \hat{\beta} = \text{OLS} \end{equation*}
☑ when \(\hat{D} = d I\) and \(d \to \infty\)

Do SVD of \(Z_i\) as \(U_i \Sigma_i V_i^\top\)

\begin{equation*} \underset{\beta}{\mathrm{argmin}} \sum_{i = 1}^{n} (Y_i - X_i \beta)^\prime (I_{n_i} - U_i U_i^\prime) (Y_i - X_i \beta) = \underset{\beta}{\mathrm{argmin}} ||Y_i - X_i \beta - Z_i b_i||_2^2 \end{equation*}

Here we are basically assuming the \(b_i\) are fixed effects. Then we get linear model and we claim \(\hat{\beta}\) in this scenario is equivalent to \(\hat{\beta}\) in LMM when \(\hat{D} \rightarrow \infty\).

DONE MLE \(\hat{\sigma}^2\)

Use profile log-likelihood

DONE Derivatives of log-likelihood function

We can have derivatives for \(\beta\), \(\sigma^2\) or \(D\). These are called the maximum likelihood equations.

\begin{equation*} \frac{\partial l}{\partial \beta} = \sigma^{-2} \sum_{i= 1}^{N} [X_i^{\prime}(Z_i D Z_i^{\prime} + I)^{-1} (Y_i - X_i \beta)] \end{equation*} \begin{equation*} \frac{\partial l}{\partial \sigma^2} = \frac{-1}{2} (\sum_{i = 1}^{N} n_i) \sigma^{-2} + \frac{1}{2} \sigma^{-4} \sum_{i= 1}^{N} [(Y_i - X_i \beta)^{\prime} (Z_i D Z_i^{\prime} + I)^{-1} (Y_i - X_i \beta)] \end{equation*} \begin{equation*} \frac{\partial l}{\partial D} = \frac{-1}{2} \sum_{i= 1}^{N} [Z_i^{\prime}(Z_i D Z_i^{\prime} + I)^{-1} Z_i - \sigma^{-2} Z_i^{\prime}(Z_i D Z_i^{\prime} + I)^{-1} (Y_i - X_i \beta) (Y_i - X_i \beta)^{\prime} (Z_i D Z_i^{\prime} + I)^{-1} Z_i] \end{equation*}

Can use helpful identities from the Matrix Cookbook

DONE Balanced random coefficient model [3/3]

See section 2.3 in Demidenko 1987.

Balanced means each group has the same number of data points.

There is fixed part and random part.

BRCM is LMM with \(Z_i = X_i = Z\), \(i = 1, 2, \ldots, N\), \(n_1 = n_2 = \ldots = n_N = n\).

The model can be written as

\begin{equation*} Y_i = Z\beta_i + \varepsilon_i, \quad i = 1, \ldots, N \\ \beta_i = \beta + b_i \\ \text{where } \beta = \mathbb{E}(\beta_i), \mathbb{E}(b_i) = 0 \end{equation*}

BRCM is attractive because it allows us to obtain maximum likelihood estimates in closed forms.

An example is a longitudinal study where we measure babies at different time points. If we measure all the babies at the same times, then it is balanced.

☑ MLE for \(\beta\)
\begin{equation*} \bar{Y} = \frac{1}{N} \sum_{i = 1}^{N} Y_i, \\ \hat{\beta}_{ML} = (Z^{\prime}Z)^{-1} Z^{\prime} \bar{Y} = \hat{\beta}_{OLS} \end{equation*}
☑ MLE for \(\sigma^2\)

Solving the maximum likelihood equation, we get something like SSR.

☑ MLE for \(D\)

Solving the maximum likelihood equation, we get a solution.

The solutions for \(\sigma^2\) and \(D\) may not be the true solutions, since there are constraints on them (\(D \geq 0\) and \(\sigma^2 > 0\)).

If global mimizer/maximizer occurs at boundary of the constraint, then gradient may not be zero at that point.

DONE Existence of MLE

The failure of an algorithm does not mean that the MLE does not exist.

The MLE in the LME model exists with probability \(1\) if the total number of observations is sufficiently large, particularly if \(\sum (n_i - k) - m > 0\).

If the MLE exists, the estimate of \(\sigma^2\) must be strictly positive.

The MLE may exist but make no sense for matrix \(D\). The existence of the MLE does not guarantee that matrix \(\hat{D}\) is positive definite.

\begin{equation*} \text{MLE exists } \Leftrightarrow S_\text{min} \triangleq \sum_{i = 1}^{N} ||Y_i - X_i\beta - Z_i b_i ||_2^2 > 0 \end{equation*}

See also section 2.5 in Demidenko 1987.

Theorem: Suppose the model identifiability conditions hold.

The MLE over \(\Theta = \{ \beta \in \mathbb{R}^m, \sigma^2 > 0, D \succeq 0 \} \text{ exists } \Leftrightarrow S_\text{min} \triangleq \sum_{i = 1}^{N} || Y_i - X_i \beta - Z_i b_i ||_2^2 > 0\)

DONE LMM with random intercept [3/3]

See section 2.4 Demidenko 1987.

\begin{equation*} y_{ij} = u_{ij}^\prime \gamma + \alpha + b_i + \varepsilon_{ij} \\ i = 1, \ldots, N \\ j = 1, \ldots, n_i \\ b_i \sim N(0, \sigma^2 d), \varepsilon_{ij} \sim N(0, \sigma^2) \end{equation*}

\(b_i\)'s and \(\varepsilon_{ij}\)'s are mutually independent.

\(\gamma\) and \(\alpha\) are regression parameters. \(u_{ij}\)'s are covariates.

☑ MLE for \(\hat{\beta}\)

Ends up being OLS.

☑ MLE for \(\sigma^2\)

Get two cases.

☑ MLE for \(d\)

Get two cases.

3.3 DONE Computation and optimization [9/9]

MLE in LMM may not have closed form.

DONE Gradient methods [4/4]

Iterative descent: start with an initial guess, successivley generate more samples such that each sample increases the function value.

☑ Steepest descent
\begin{equation*} x^{k+1} = x^{k} - \alpha^k \nabla f(x^k) \\ \alpha^k > 0 \text{ is the step size} \\ \nabla f(x^k) \text{ is the direction chosen} \end{equation*}

So we are basically moving in the direction of the steepest decrease based on the gradient.

☑ Newton's method

Gradient descent is guided by local information. It may be improved if we use MORE local information, such as Hessian.

\begin{equation*} x^{k+1} = x^{k} - \alpha^k \left[\nabla^2 f(x^k)\right]^{-1} \nabla f(x^k) \\ \left[\nabla^2 f(x^k)\right]^{-1} \text{ is the Hessian} \end{equation*}
☑ Pure Newton iteration

This is just Newton's method with \(\alpha^k = 1\).

☑ Armijo rule

This rule basically guarantees that we will make progress at each iteration given a small enough step size.

DONE Convergence issues

We may stop at a stationary point instead of the global minimizer.

The sequence may also not converge given a bad initialization.

DONE Rate of Convergence

Gradient descent Linear convergence
Newton's method Quadratic convergence

There is a trade-off between convergence rate and computation time.

DONE Expectation-Maximization algorithm

E Step: calculate conditional expectation of latent parameters given data and parameter values from previous iteration.

M step: solve the maximization problem to maximize this expectation with respect to parameters.

In LMM, distribution of latent parameters (random effects) given data and parameters is Gaussian.

See also section 2.12 of Demidenko 1987.

DONE Fisher Scoring Algorithm

We use this to solve maximization of \(P(Y; \theta)\) with respect to \(\theta\).

\begin{equation*} I(\theta^k) = E\left[-\nabla^2 \log P(Y;\theta^k) \right] \\ \theta^{k+1} = \theta^k + \alpha^k I^{-1}(\theta^k) \nabla \log P(Y ; \theta^k) \end{equation*}

\(I(\theta^k)\) is the Fisher information matrix.

DONE RMLE in LMM

RMLE is the MLE of transformed data. We transform the data to get rid of nuisance parameters. RMLE can reduce the bias of MLE.

For example, MLE of \(\sigma^2\) is \(\text{SSE} / n\), which is biased. RMLE of \(\sigma^2\) is MSE, which is unbiased.

The data transformation is of the form \(\tilde{Y} = A^\prime Y\). RMLE will not depend on the choice of A.

DONE MINQUE in LMM

This is similar to idea of BLUE (best linear unbiased estimator).

Now we are interested in the Minimum norm quadratic unbiased estimator. We use this because \(\hat{\sigma}^2 = \text{MSE}\) is not a linear esitmator, but a quadratic estimator. It takes the form

\begin{equation*} Y^\prime AY \\ A \succeq 0 \end{equation*}

Which quadratic forms are unbiased? In LM setting \(Y = X\beta + \varepsilon\), we have three conditions to make sure \(Y^\prime AY\) is unbiased.

\begin{equation*} X^\prime A X = 0 \\ \text{tr}(A) = 1 \\ A \succeq 0 \end{equation*}

The we find the MINQUE by solving \(\text{argmin} ||A||_\text{F}^2\) under these conditions.

In the gaussian case, solving for MINQUE is actually just reducing the variance of the quadratic form.

In LMM case, we have different conditions for \(Y^\prime A Y\) to be unbiased.

\begin{equation*} X^\prime A X = 0 \\ \text{tr}(A) = 1 \\ Z^\prime A Z = 0 \\ A \succeq 0 \end{equation*}

In this case, MINQUE is like MSE under linear model with \(Zb\) treated as fixed effects.

Also see 1.5.2 in Jiang 2007.

DONE ANOVA Estimation

The basic idea comes from the method of moments. Use expectations of different sums of squares and solve for the parameters of interest. The hardest part is coming up with the relevant sums of squares.

Suppose there are \(q\) variance components involved in a linear mixed model. Let \(Q\) be a \(q\) -dimensional vector whose components are quadratic functions of the data. The ANOVA estimators of the variance components are obtained by solving the system of equations \(E(Q) = Q\). Then the problem is just choosing \(Q\). See 1.5.1 in Jiang 2007.

DONE A general method of moments

Construct sample moments, match to population moments, and solve the system of equations.

3.4 DONE Hypothesis testing and confidence intervals [6/6]

DONE F test

The testing problem is: \(H_0: D = 0 \quad H_1 : D \neq 0\).

We are basically testing the presence of random effects.

SSE under full model,

\begin{equation*} S_\text{min} = \min_{\beta, \{b_i\}} \sum_{i = 1}^N ||Y_i - X_i \beta - Z_i b_i ||_2^2 \end{equation*}

SSE under null,

\begin{equation*} S_\text{OLS} = \min_{\beta} \sum_{i = 1}^N ||Y_i - X_i \beta||_2^2 \end{equation*}

We then use quadratic form theory to show that both SSE's are chi-squared and independent, and then we can construct the \(F\) test statistic as

\begin{equation*} F = \frac{(S_{\text{OLS}} - S_\min) / \tilde{r}}{S_\min / r} \sim F_{\tilde{r}, r} \quad \text{under } H_0 \\ \tilde{r} = \text{rank}(W) - \text{rank}(X), \quad W = (X \vert Z) \\ r = \sum_{i = 1}^N n_i - \text{rank}(W) \end{equation*}

We reject \(H_0\) if \(F > F_{\tilde{r}, r} (1 - \alpha)\)

In general in LMM, we don't have a nice \(F\) test for \(A\beta\).

DONE Likelihood ratio test

Let \(X_1, \ldots, X_n\) be a random sample from density \(P(X ; \theta)\) with parameter \(\theta\) and parameter space \(\Theta\). Consider the test

\begin{equation*} H_0 : \theta \in \Theta_0 \\ (\Theta_0 \subseteq \Theta \text{ is a subset of the parameter space}) \end{equation*}

The likelihood ratio statistic is then:

\begin{equation*} R = \frac{\sup_{\theta \in \Theta_0} \prod_{i = 1}^{n} P(X_i ; \theta)}{\sup_{\theta \in \Theta} \prod_{i = 1}^{n} P(X_i ; \theta)} \end{equation*}

We reject \(H_0\) when \(R\) is small, where \(0 \leq R \leq 1\).

We don't actually work directly with \(R\), instead we work with \(-2 \log R\), where

\begin{equation*} -2 \log R \rightarrow \chi^2_r \text{ as } n \rightarrow \infty \text{ under } H_0, \\ r = \text{dim}(\Theta) - \text{dim}(\Theta_0) \end{equation*}

Therefore, we can only assymptotically control type I error, with rejection region

\begin{equation*} \{ -2 \log R > c\}, c = \chi^2_r(1 - \alpha) \end{equation*}

DONE C.I. for \(\sigma^2\) in LMM

\(S_\min = \min_{\beta, \{b_i\}} \sum_{i = 1}^N ||Y_i - X_i \beta - Z_i b_i ||_2^2\).

\begin{equation*} W = \begin{pmatrix} X_1 & Z_1 & 0 & 0 \\ \vdots & 0 & \ddots & 0 \\ X_n & 0 & 0 & Z_n \end{pmatrix} \end{equation*}

Our pivot is then

\begin{equation*} \frac{S_\min}{\sigma^2} \sim \chi^2_r, \quad r = \sum_{i = 1}^{N} n_i - \text{rank}(W) \end{equation*} \begin{equation*} \mathbb{P}\left(\chi_r^2 (\frac{\alpha}{2}) \leq \frac{S_\min}{\sigma^2} \leq \chi^2_r (1 - \frac{\alpha}{2})\right) = 1 - \alpha \end{equation*}

Solving for \(\sigma^2\) gives a \(1 - \alpha\) level confidence interval. We get

\begin{equation*} \left( \frac{S_\min}{\chi^2_r (1 - \frac{\alpha}{2})}, \frac{S_\min}{\chi^2_r (\frac{\alpha}{2})} \right) \end{equation*}

DONE Confidence region for \(D \in \mathbb{R}^{k \times k}\)

An exact C.I. for \(D\) may be obtained in some special cases. Foe example,

\begin{equation*} y_{ij} = \mu + \alpha_i + \varepsilon_{ij}, \quad i = 1, \ldots, N, \quad j = 1, \ldots, n, \\ \alpha_i \sim N(0, \sigma^2_a) \\ \varepsilon_{ij} = N(0, \sigma^2) \end{equation*}

So here, \(D = \frac{\sigma^2_a}{\sigma^2}\)

We then know

\begin{equation*} \text{SSA} = \sum_{i = 1}^N \sum_{j = 1}^n (\bar{Y_i} - \bar{Y})^2, \quad \bar{Y_i} = \frac{1}{n} \sum_{j = 1}^n y_{ij} \\ \text{SSE} = \sum_{i = 1}^N \sum_{j = 1}^n (y_{ij} - \bar{Y_i})^2, \quad \bar{Y} = \frac{1}{Nn} \sum_{i = 1}^N \sum_{j = 1}^n y_{ij} = \frac{1}{N} \sum_{i = 1}^N \bar{Y_i} \end{equation*}

with

\begin{equation*} \frac{\text{SSE}}{\sigma^2} \sim \chi^2_{Nn-N}, \\ \frac{\text{SSA}}{n \sigma^2_a + \sigma^2} \sim \chi^2_{N-1}, \\ \text{SSA} \perp \text{SSE} \end{equation*}

Therefore, we can construct an \(F\) distribution to get a \(1 - \alpha\) level C.I. for \(D = \frac{sigma^2_a}{\sigma^2}\) as

\begin{equation*} \left[ \frac{1}{n} \left( \frac{\text{MSA} / \text{MSE}}{F_{N - 1, N(n-1)}(1 - \frac{\alpha}{2})} - 1 \right), \frac{1}{n} \left( \frac{\text{MSA} / \text{MSE}}{F_{N - 1, N(n-1)}(\frac{\alpha}{2})} - 1 \right) \right], \\ \text{MSA} = \frac{\text{SSA}}{N - 1}, \quad \text{MSE} = \frac{\text{SSE}}{N(n-1)} \end{equation*}

DONE C.I. for functions of \(\sigma^2\) and \(D\)

An exact C.I. may exist in some special cases, such as the one we just considered above.

For example, we may want a C.I. for \(\sigma^2_a + \sigma^2\).

If we look at the data \(y_{11}, y_{21}, \ldots, y_{n1}\), we know that they are iid \(N(\mu, \sigma^2_a + \sigma^2)\). This can be called the first group \((i = 1)\). The idea is to get a summary statistic for each group, denoted \(Q_1, Q_2, \ldots, Q_N\), in the form

\begin{equation*} Q_i = \sum_{j = 1}^{n} c_j y_{ij} \sim N(\mu, \sigma^2_a + \sigma^2) \end{equation*}

with two constraints to match the mean and variance, respectively,

\begin{equation*} \sum_{j = 1}^n c_j = 1, \\ \sum_{j = 1}^n c_j^2 = 1 \end{equation*}

This way we can use all the data instead of just one group.

DONE Approximate C.I. for variance components [2/2]

Continuing the example from above, suppose we want a C.I. for some linear function

\begin{equation*} \xi = c_1 \mathbb{E}(\text{MSA}) + c_2 \mathbb{E}(\text{MSE}), \quad c_1, c_2 > 0 \\ \mathbb{E}(\text{MSA}) = n\sigma^2_a + \sigma^2, \quad \mathbb{E}(\text{MSE}) = \sigma^2 \end{equation*}

e.g. \(\sigma^2_a + \sigma^2 = \text{Var}(y_{ij}) \Rightarrow c_1 = \frac{1}{n}, \quad c_2 = 1 - \frac{1}{n}\)

We then have that \(\hat{\xi} = c_1 \text{MSA} + c_2 \text{MSE}\) is unbiased for \(\xi\).

And we can approximate \(\hat{\xi}\) by a \(\chi^2\) distribution.

☑ Sattertheuaite's Approximation

We match the first and second moments as

\begin{equation*} \frac{\hat{\xi}}{\xi} \sim \frac{\chi^2_d}{d}, \\ \text{Var}(\frac{\hat{\xi}}{\xi}) = \text{Var}(\frac{\chi^2_d}{d}) \end{equation*}
☑ Large-sample approximation

We have

\begin{equation*} \frac{\hat{\xi} - \xi}{\sqrt{\text{Var}(\hat{\xi})}} \rightarrow N(0, 1), \quad N \rightarrow \infty \end{equation*}

3.5 DONE Generalized linear models [6/6]

DONE GLM Setup

Recall classical linear models: \(Y_i = X_i^{\prime} \beta + \varepsilon_i, \quad i = 1, \ldots, n\) with \(\varepsilon_i \sim N(0, \sigma^2)\).

  1. the random component: \(Y_i \sim N(\mu_i, \sigma^2)\) with all the \(Y_i\) independent
  2. the systematic component: a linear predictor \(\eta_i + X_i^{\prime} \beta, \quad i = 1, \ldots, n\)
  3. the link between the random and systematic component: \(\mu_i = \eta_i, \quad i = 1, \ldots, n\)

GLM can be viewed as an extension of linear models, where

a. The distribution in the random component may come from an exponential family

b. The link between \(\mu_i\) and \(\eta_i\) may be through general monotonic differentiable functions: \(\eta_i = g(\mu_i)\).

DONE Exponential family distribution [4/4]

The pdf/pmf can be expressed in the form:

\begin{equation*} f(y ; \theta, \phi) = \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(\phi, y) \right] \end{equation*}

where \(a(\cdot), b(\cdot), \text{ and } c(\cdot)\) are some specific functions.

\(\theta\): canonical parameter

\(\phi\): dispersion parameter (either known or unknown).

\begin{equation*} \mathbb{E}(y) = b^{\prime}(\theta) \triangleq \mu \\ \text{Var}(y) = b^{\prime\prime}(\theta) a(\phi) \\ \theta = (b^{\prime})^{-1} (\mu), (b \text{ derivative inverse of } \mu) \\ \text{"Variance function" } = b^{\prime\prime}((b^{\prime})^{-1}(\mu)) = V(\mu) \end{equation*}

\(a(\phi)\) is positive in general.

Denote the log-likelihood function for the exponential family:

\begin{equation*} l(\theta, \phi) = \frac{y \theta - b(\theta)}{a(\phi)} + c(\phi, y) \end{equation*}

We know

\begin{equation*} \text{Score } = \mathbb{E} \left(\frac{\partial l}{\partial \theta} \right) = 0 \\ \text{Information } = \text{Var}\left(\frac{\partial l}{\partial \theta}\right) = - \mathbb{E}\left(\frac{\partial^2 l}{\partial \theta^2}\right) \end{equation*}
☑ Gaussian
\begin{equation*} y \sim N(\mu, \sigma^2), \\ f(y) = \exp \left[ \frac{y\mu - \frac{\mu^2}{2}}{\sigma^2} - \frac{1}{2} \left( \frac{y^2}{\sigma^2} + \log (2 \pi \sigma^2) \right) \right] \\ \theta = \mu, \phi = \sigma^2, b(\theta) = \frac{\theta^2}{2}, a(\phi) = \phi, c(\phi, y) = \frac{-1}{2} \left( \frac{y^2}{\sigma^2} + \log (2 \pi \sigma^2) \right) \end{equation*}
☑ Poisson
\begin{equation*} \mathbb{P}(y = k) = e^{-\lambda} \frac{\lambda^k}{k!}, k = 0, 1, \ldots, \quad \lambda > 0 \end{equation*}
☑ Binomial
\begin{equation*} \mathbb{P}(y = k) = {n \choose k} p^k (1-p)^{n - k}, k = 0, 1, \ldots, n, \quad p \in (0, 1) \end{equation*}
☑ Gamma
\begin{equation*} \mathbb{P}(y) = \frac{\beta^\alpha}{\Gamma(\alpha)} y^{\alpha - 1} e^{-\beta y}, y > 0, \alpha, \beta > 0 \end{equation*}

DONE Link Functions [4/4]

\begin{equation*} \mu = \mathbb{E}(Y) \\ g(\mu) = \eta = X^{\prime}\beta \end{equation*}
  1. In classical linear model, \(g(\mu) = \mu\).
  2. When \(Y\) is not continuous over \((-\infty, \infty)\), the identity link is less attractive.

The link function links together the mean of \(y_i\) and the linear predictors.

Some popular choices for binary data, which map \([0, 1] \rightarrow \mathbb{R}\)

☑ Logit

"Logistic regression"

\begin{equation*} g(\mu) = \log \left( \frac{\mu}{1 - \mu} \right) \end{equation*}
☑ Probit
\begin{equation*} g(\mu) = \Phi^{-1} (\mu) \end{equation*}

where \(\Phi(\cdot)\) is cdf of \(N(0, 1)\)

☑ Complementary log-log
\begin{equation*} g(\mu) = \log [ -\log (1 - \mu)] \end{equation*}

Deciding which link function to use is a model selection problem.

☑ Canonical links

\(\theta\) is the canonical parameter in the exponential family, with \((\theta = b^{\prime})^{-1}(\mu)\). So, the canonical link is just \((b^{\prime})^{-1}(\cdot)\).

Distribution \(b(\theta)\) Canonical Link
Normal \(\frac{\theta^2}{2}\) \(g(\mu) = \mu\)
Poisson \(e^{\theta}\) \(g(\mu) = \log \mu\)
Bernoulli \(\log(1 + e^\theta)\) \(g(\mu) = \log \frac{\mu}{1 - \mu}\)
Gamma \(- \log (-\theta)\) \(g(\mu) = \frac{-1}{\mu}\)

DONE log-likelihood for GLM

\begin{equation*} l(\beta, \phi) = \sum_{i = 1}^n \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(\phi, y_i) \right] \end{equation*}

We can replace \(\theta_i\) with function of \(X_i^{\prime} \beta\) since \(\mu_i = b^{\prime}(\theta_i)\) and \(g(\mu_i) = X_i^{\prime} \beta\).

DONE Fisher scoring algorithm

We want to maximize \(l(\beta, \phi)\) for \(\beta\) and \(\phi\).

We assume without loss of generality that \(a(\phi) > 0\).

\begin{equation*} \hat{\beta}_\text{ML} = \arg\max_{\substack{\beta \in \mathbb{R}^p}} \sum_{i = 1}^n (y_i \theta_i - b(\theta_i)) \triangleq \arg\max_{\substack{\beta \in \mathbb{R}^p}} \tilde{l}(\beta) \end{equation*}

We use the Fisher Scoring Algorithm (iterative least squares) to minimize \(-\tilde{l} (\beta)\)

\begin{equation*} \beta^{t+1} = \beta^t + \alpha^t \left[ \sum_{i = 1}^n \frac{X_iX_i^{\prime}}{V(\mu_i) (g^{\prime}(\mu_i))^2} \right]^{-1} \left[ \sum_{i = 1}^n \frac{(y_i - \mu_i)}{V(\mu_i) g^{\prime}(\mu_i)} X_i \right] \\ \mu_i = g^{-1} (X_i^{\prime}\beta^t) \end{equation*}

where \(\alpha^t\) is the step size.

We can also write it as

\begin{equation*} Y = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}, \mu = \begin{pmatrix} \mu_1 \\ \vdots \\ \mu_n \end{pmatrix}, \Delta = \begin{pmatrix} g^{\prime}(\mu_1) & \ldots & 0 \\ 0 & \ddots & 0 \\ 0 & \ldots & g^{\prime}(\mu_n) \end{pmatrix}, \\ X = \begin{pmatrix} X_1^{\prime} \\ \vdots \\ X_n^{\prime} \end{pmatrix}, W = \begin{pmatrix} \frac{1}{V(\mu_1)(g^{\prime}(\mu_1))^2} & \ldots & 0 \\ 0 & \ddots & 0 \\ 0 & \ldots & \frac{1}{V(\mu_n)(g^{\prime}(\mu_n))^2}) \end{pmatrix} \end{equation*}

So \(\beta^{t+1} = \beta^t + \alpha^t (X^{\prime}WX)^{-1} X^{\prime}W\Delta(Y - \mu)\).

We commonly choose \(\alpha^t = 1\), so we get

\begin{equation*} \beta^{t+1} = (X^{\prime}WX)^{-1} X^{\prime}W [X\beta^t + \Delta(Y - \mu)] \end{equation*}

which we recognize as weighted OLS.

In the Gaussian case, we get OLS in one step.

When using a canonical link, you always get a convex optimization problem, and the Fisher scoring algorithm is equivalent to Newton's method.

After getting \(\hat{\beta}\), we plug it back into \(l(\beta, \phi)\) and solve the one-dimensional problem for \(\phi\).

DONE Inference in GLM [4/4]

We have \(\{ y_i \}_{i = 1}^n\) are independent.

Each \(y_i\) has pmf/pdf

\begin{equation*} f(y ; \theta, \phi) = \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(\phi, y) \right] \\ \mathbb{E}(y_i) = b^{\prime}(\theta_i) = \mu_i, \quad g(\mu_i) = X_i^{\prime} \beta. \end{equation*}

The parameters are \(\theta = (\beta, \phi)\).

The log-likelihood function is \(l_n(\theta)\).

The score function is \(S_n(\theta) = \frac{\partial l_n(\theta)}{\partial \theta}\).

The information matrix is \(F_n(\theta) = \text{Cov} (S_n(\theta)) = \mathbb{E} \left[ - \frac{\partial^2 l_n(\theta)}{\partial \theta^2} \right]\).

Generally speaking, no exact tests such as \(F\) -tests are available. We rely on large-sample tests.

☑ Likelihood ratio test
\begin{equation*} H_0: \theta \in \Theta_0, \quad H_1: \theta \in \Theta \backslash \Theta_0 \end{equation*}

where \(\Theta\) is the parameter space of \(\theta\).

The likelihood ratio test statistic is

\begin{equation*} T_R = -2 \log \frac{\max_{\theta \in \Theta_0} e^{l_n(\theta)}}{\max_{\theta \in \Theta} e^{l_n(\theta)}} = 2 \left[ \sup_{\theta \in \Theta} l_n(\theta) - \sup_{\theta \in \Theta_0} l_n(\theta) \right], \\ T_R \rightarrow \chi_r^2 \text{ as } n \rightarrow \infty \text{ under } H_0, \\ r = \text{dim}(\Theta) - \text{dim}(\Theta_0) \end{equation*}

Reject \(H_0\) if \(T_R\) is large.

\begin{equation*} T_R > \chi^2_r (1 - \alpha) \end{equation*}

Type 1 error \(\rightarrow \alpha\) as \(n \rightarrow \infty\).

☑ Wald Test

The Wald test uses the large-sample normality of the ML estimator to form a test.

\begin{equation*} H_0: h(\theta) = 0, \quad \Theta_0 = \{\theta : h(\theta) = 0 \}, \quad h: \mathbb{R}^p \rightarrow \mathbb{R}^q, \\ h(\theta) = \begin{pmatrix} h_1(\theta) \\ \vdots \\ h_q(\theta) \end{pmatrix} \end{equation*}

Assume \(h\) is continuously differentiable.

The Jacobian \(Dh(\theta) \in \mathbb{R}^{q \times p}\) has entries \(\left[ Dh(\theta) \right]_{ij} = \frac{\partial h_i(\theta)}{\partial \theta_j}\).

\(Dh(\theta)\) is full rank under \(H_0\).

Denote the MLE \(\hat{\theta}_n = \arg\max_{\substack{\theta \in \Theta}} l_n(\theta)\).

\begin{equation*} \hat{\theta}_n \approx N(\theta, F_n^{-1}(\theta)) \end{equation*}

and using the Delta method to get asymptotic distribution of a function of the MLE, we have

\begin{equation*} h(\hat{\theta}_n) \approx N(h(\theta), Dh(\theta) F_n^{-1}(\theta) D^{\prime}h(\theta)) \\ \Downarrow \\ (h(\hat{\theta} - 0)^{\prime} [Dh(\theta) F_n^{-1}(\theta) D^{\prime}h(\theta)]^{-1} (h(\hat{\theta} - 0) \approx \chi^2_q \end{equation*}

and since we don't know \(\theta\), we replace by \(\hat{\theta}\) to get

\begin{equation*} T_W \triangleq (h(\hat{\theta} - 0)^{\prime} [Dh(\hat{\theta}) F_n^{-1}(\hat{\theta}) D^{\prime}h(\hat{\theta})]^{-1} (h(\hat{\theta} - 0) \approx \chi^2_q \end{equation*}

Reject if \(T_W \geq \chi^2_q (1 - \alpha)\).

The Wald Test and the LRT for the same \(H_0\) give the same asymptotic distribution, \(\chi^2_q = \chi^2_r, \quad q = r\).

☑ Score Test
\begin{equation*} H_0: \theta \in \Theta_0, \quad H_1: \theta \in \Theta \backslash \Theta_0, \\ \tilde{\theta}_n = \arg\max_{\substack{\theta \in \Theta}} l_n(\theta) \end{equation*}

Under regularity conditions,

\begin{equation*} T_S \triangleq S_n^{\prime}(\tilde{\theta}_n) F_n^{-1}(\tilde{\theta}_n) S_n(\tilde{\theta}_n) \rightarrow \chi^2_r \text{ as } n \rightarrow \infty \text{ under } H_0 \\ r = \text{dim}(\Theta) - \text{dim}(\Theta_0) \end{equation*}

We reject \(H_0\) if \(T_s > \chi^2_r (1 - \alpha)\).

☑ Confidence Intervals

Exact C.I. is not available in general. We construct large-sample C.I.

  1. C.I. for \(\xi = h(\theta) (h: \mathbb{R}^p \rightarrow \mathbb{R}^q\) based on LRT statistic.

Approximate pivot:

\begin{equation*} P_R = 2 \left[ \sup_{\theta \in \Theta} l_n(\theta) - \sup_{\theta \in \Theta_0} l_n(\theta) \right], \\ \Theta_0 = \{\theta \in \Theta : h(\theta) = \xi \} \\ P_R \rightarrow \chi^2_q \text{ as } n \rightarrow \infty \end{equation*}
  1. C.I. for \(\xi = h(\theta)\) based on Wald statistic.

Approximate pivot:

\begin{equation*} P_W = (h(\hat{\theta} - \xi)^{\prime} [Dh(\hat{\theta}) F_n^{-1}(\hat{\theta}) D^{\prime}h(\hat{\theta})]^{-1} (h(\hat{\theta} - \xi) P_W \rightarrow \chi^2_q \text{ as } n \rightarrow \infty \end{equation*}
  1. C.I. for \(\theta\) based on Score statistic.

Approximate pivot:

\begin{equation*} P_S = S_n^{\prime}(\theta) F_n^{-1}(\theta) S_n(\theta) \rightarrow \chi^2_p \text{ as } n \rightarrow \infty \end{equation*}

If instead we want a C.I. for \(\xi = h(\theta)\), we use

\begin{equation*} P_S = S_n^{\prime}(\tilde{\theta}_\xi) F_n^{-1}(\tilde{\theta}_\xi) S_n(\tilde{\theta}_\xi) \rightarrow \chi^2_p \text{ as } n \rightarrow \infty \\ \tilde{\theta}_\xi = \arg \max_{\theta : h(\theta) = \xi} l_n(\theta) \end{equation*}

3.6 DONE Quasi-likelihood estimation [3/3]

In many problems of statistical estimation, we know some detail of the distribution governing the data, but may be unwilling to specify it exactly. Maximum likelihood requires exact specification of the distrivubution in order to construct the likelihood. Quasi-likelihood is a method of estimation that requires only a model for the mean of the data and the relationship between the mean and the variance.

The idea is to use normality-based estimators even if the data are not really normal.

DONE Review of MLE

Important characteristics of likelihood:

\begin{equation*} \mathbb{E}\left( \frac{\partial \log p(Y ; \theta^*)}{\partial \theta} \right) = 0, \\ \text{Var}\left( \frac{\partial \log p(Y ; \theta^*)}{\partial \theta} \right) = \mathbb{E}\left( \frac{\partial^2 \log p(Y ; \theta^*)}{\partial \theta^2} \right) \end{equation*}

Suppose \(Y_1, Y_2, \ldots, Y_n \sim \mathbb{P}(Y ; \theta^*)\)

\begin{equation*} l_n(\theta) = \sum_{i = 1}^n \log p(Y_i ; \theta), \quad S_n(\theta) = \sum_{i = 1}^n \frac{\partial \log p(Y ; \theta)}{\partial \theta} \end{equation*} \begin{equation*} \hat{\theta}_n = \arg \max_\theta l_n (\theta), \\ \end{equation*} \begin{equation*} {S_n (\hat{\theta}_n) = 0, \\ \mathbb{E}(S_n(\theta^*)) = 0, \\ S_n(\theta) \approx \mathbb{E}(S_n(\theta))} \\ \Downarrow \\ \hat{\theta}_n \approx \theta^* \end{equation*}

And using a first order Taylor expansion, we have that

\begin{equation*} \sqrt{n} (\hat{\theta}_n - \theta^*) \approx N(0, I^{-1}(\theta^*)), \\ I(\theta^*) = \text{Cov}\left( \frac{\partial \log p(Y ; \theta^*)}{\partial \theta} \right) \end{equation*}

DONE Construction of Quasi-likelihood function

For a random variable \(Y \in \mathbb{R}\), let \(\mathbb{E}(Y) = \mu, \text{Var}(Y) = \sigma^2 \cdot V(\mu)\).

Define the function: \(h(Y;\mu) = \frac{Y - \mu}{\sigma^2 V(\mu)}\).

Then

\begin{equation*} \mathbb{E}(h(Y ; \mu)) = 0 \\ \text{Var}(h(Y ; \mu)) = -\mathbb{E}\left( \frac{\partial h(Y;\mu)}{\partial \mu} \right) = \frac{1}{\sigma^2 V(\mu)} \end{equation*}

The log quasi-likelihood function:

\begin{equation*} Q(Y ;\mu) = \int_y^\mu \frac{y - t}{\sigma^2 V(t)} dt. \end{equation*}

DONE Maximum quasi-likelihood estimation

Back to our regression setting \(\{(x_i, y_i)\}_{i = 1}^n\).

The log quasi-likelihood function is:

\begin{equation*} L(\beta) = \sum_{i = 1}^n Q(Y_i ; \mu_i), \text{ where } g(\mu_i) = X_i^{\prime}\beta \end{equation*}

The quasi-likelihood estimator \(\hat{\beta}\) satisfies

\begin{equation*} \sum_{i = 1}^n \frac{(Y_i - \hat{\mu}_i)X_i^{\prime}}{V(\hat{\mu}_i)g^{\prime}(\hat{\mu}_i)} = 0 \end{equation*}

where \(g(\hat{\mu}_i) = X_i^{\prime}\hat{\beta})\).

Let

\begin{equation*} W = \begin{pmatrix} \frac{1}{V(\mu_1)g^{\prime}(\hat{\mu}_1)^2} & & \\ & \ddots & \\ & & \frac{1}{V(\mu_n)g^{\prime}(\hat{\mu}_n)^2} \end{pmatrix}, \quad \Delta = \begin{pmatrix} g^{\prime}(\mu_1)^2 & & \\ & \ddots & \\ & & g^{\prime}(\mu_n)^2 \end{pmatrix}, \\ Y = \begin{pmatrix} Y_1 \\ \vdots \\ Y_n \end{pmatrix}, \quad \mu = \begin{pmatrix} \mu_1 \\ \vdots \\ \mu_n \end{pmatrix}, \quad X = \begin{pmatrix} X_1 \\ \vdots \\ X_n \end{pmatrix} \end{equation*}

Under regularity conditions, \(\hat{\beta} \approx N(\beta, \sigma^2 (X^{\prime}WX)^{-1})\).

We compute the quasi-likelihood estimator using the Fisher scoring algorithm as:

\begin{align*} \beta^{t + 1} &= \beta^t + (X^{\prime}WX)^{-1}X^{\prime}W\Delta(Y - \mu) \\ &= (X^{\prime}WX)^{-1}X^{\prime}W(X\beta^t + \Delta(Y - \mu)) \end{align*}

where \(W, \Delta, \mu\) all depend on \(\beta^t\).

Estimation of \(\sigma^2\):

\begin{equation*} \hat{\sigma}^2 = \frac{1}{n-p} \sum_{i = 1}^n \frac{(Y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)}, \end{equation*}

where \(g(\hat{\mu}_i) = X_i^{\prime}\hat{\beta}), \hat{\beta} \in \mathbb{R}^p\).

See more about Quasi-likelihood in Ch. 9 of McCullagh and Nelder.

3.7 DONE Generalized linear mixed models [10/10]

DONE General setup

\(Y_i | b \sim \mathbb{P}(Y_i | b) = \exp \left[ \frac{Y_i \theta_i - b(\theta_i)}{a(\phi)} + c(\phi, Y_i) \right], \quad i = 1, \ldots, n\).

\(\mu_i = b^{\prime}(\theta_i) = \mathbb{E}(Y_i | b)\).

\(g(\mu_i) = X_i^{\prime}\beta + Z_i^{\prime} b\), \(g\) is a link function.

\(b \sim f_\gamma(b)\), \(\gamma\) is the parameter in the random effect distribution.

This formulation is for one group/cluster data.

Like in LMM, \(b\) induces dependency among \(\{ Y_i \}_{i = 1}^n\) (data within group).

GLMM extends LMM to deal with more types of data (such as binary or integers).

DONE Mean

\begin{align*} \mathbb{E}(Y_i) &= \mathbb{E}[\mathbb{E}(Y_i | b)] \\ &= \mathbb{E}(\mu_i) \\ &= \mathbb{E}[g^{-1}(X_i^{\prime}\beta + Z_i^{\prime}b)] \end{align*}

which can't be further simplified in general since \(g^{-1}(\cdot)\) is nonlinear.

DONE Variance

\(\text{Var}(Y_i | b) = a(\phi) + b^{\prime\prime}(\theta_i)\), with \(b^{\prime\prime}(\theta_i) = V(\mu_i)\).

\begin{align*} \text{Var}(Y_i) &= \mathbb{E}[\text{Var}(Y_i | b)] + \text{Var}[\mathbb{E}(Y_i | b)] \\ &= \mathbb{E}[a(\phi)V(g^{-1}(X_i^{\prime}\beta + Z_i^{\prime}b))] + \text{Var}[g^{-1}(X_i^{\prime}\beta + Z_i^{\prime}b)] \end{align*}

DONE Covariance

\begin{align*} \text{Cov}(Y_i, Y_j) &= \text{Cov}(\mathbb{E}(Y_i | b), \mathbb{E}(Y_j | b)) + \mathbb{E}(\text{Cov}((Y_i, Y_j | b)) \\ &= \text{Cov}(g^{-1}(X_i^{\prime}\beta + Z_i^{\prime}b), g^{-1}(X_j^{\prime}\beta + Z_j^{\prime}b)) + 0 \end{align*}

DONE Likelihood function

In many situations, there is no closed form for the likelihood function.

\begin{equation*} L = \int \exp \left[ \sum_{i = 1}^n \left[ \frac{Y_i \theta_i - b(\theta_i)}{a(\phi)} + c(\phi, Y_i) \right] \right] \cdot f_\gamma(b) db \end{equation*}

This is a form of "\(k\) -dimensional integration".

DONE log-likelihood function

Does not have closed form in general.

DONE Gauss-hermite quadrecture

Used to evalue an integral of the form \(\int_{-\infty}^\infty h(x)e^{-x^2} dx\).

We have

\begin{equation*} \int_{-\infty}^\infty h(x)e^{-x^2} dx \approx \sum_{k = 1}^m h(x_k) w_k \end{equation*}

where \(x_k\) are evolution points, \(w_k\) are weights, and \(h(x) = \sum_{j = 0}^n a_j x^j\) is a given polynomial. We pick \(n\) to be \(2m - 1\) to have enough equations to solve for all \(x_k\) and \(w_k\). We have software to give evaluation points and weights for given \(m\).

DONE EM algorithm for GLMM

Step 1

Treat \(b\) as latent variable (missing data).

\(\begin{pmatrix} Y \\ b \end{pmatrix}\) is the complete data.

Step 2

Need log-likelihood of complete data.

\begin{align*} \log \mathbb{P}(Y, b) &= \log \mathbb{P}(Y | b) \mathbb{P}(b) \\ &= \sum_{i = 1}^n \left[ \frac{Y_i \theta_i - b(\theta_i)}{a(\phi)} + c(\phi, Y_i) \right] + \log f_\gamma(b) \\ &= \log \mathbb{P}(Y, b ; \beta, \phi, \gamma) \end{align*}

Step 3

E-step; conditional expectation of log-likelihood of complete data with respect to \(b|Y\).

\begin{align*} (\beta^{t + 1}, \phi^{t+1}) &= \arg \max_{\beta, \phi} \mathbb{E}_b \left[ \sum_{i = 1}^n \left( \frac{Y_i \theta_i - b(\theta_i)}{a(\phi)} + c(\phi, Y_i) \right) | Y \right] \text{ under } \beta = \beta^t, \phi = \phi^t \\ \gamma^{t + 1} &= \arg \max_\gamma \mathbb{E}_b[\log f_\gamma (b) | Y ] \text{ under } \gamma = \gamma^t \end{align*}

\(\mathbb{E}_b[\cdot|Y]\) is taken under the value \((\beta^t, \phi^t, \gamma^t)\). This is gaussian under Gaussian LMM, but this is not true in general for GLMM. It's hard to compute.

\begin{equation*} \mathbb{P}(b|Y) = \frac{\mathbb{P}(Y|b)\mathbb{P}(b)}{\mathbb{P}(Y)} \end{equation*}

\(\mathbb{P}(Y)\) is not readily available, use MCMC.

DONE Markov Chain Monte Carlo algorithm

Markov Chain: a sequence of random vectors in \(\mathbb{R}^k :X_1, X_2, X_3, \ldots\) form a (time homogeneous) Markov Chain if

\begin{equation*} \mathbb{P}(X_{n + 1} = x_{n+1} | X_1 = x_1, \ldots, X_n = x_n) = \mathbb{P}(X_{n + 1} = x_{n+1} | X_n = x_n) = Q(x_{n + 1} ; x_n) \text{ "transition kernel"} \end{equation*}

for all \(n \geq 1\) and all \(x_1, \ldots, x_n\) values.

"Conditional on the present, the future and the past are independent."

Stationary Distribution: Let \(\pi_n(\cdot)\) be the distribution for \(X_n\).

\begin{align*} \pi_{n+1}(a) = \mathbb{P}(X_{n + 1} = a) &= \int \mathbb{P}(X_{n+1} = a | X_n = y) \mathbb{P}(X_n = y)dy \\ &= \int Q(a ;y) \pi_n(y) dy \end{align*}

\(\pi(\cdot)\) is called a stationary distribution if and only if \(\pi(a) = \int Q(a;y)\pi(y)dy\) for all \(y\).

"If \(X_n\) has distribution \(\pi\), then \(X_{n + 1}\) will have distribution \(\pi\)."

Under regularity conditions on the Markov Chain, it holds that:

\begin{equation*} \pi_n(\cdot) \rightarrow \pi(\cdot) \text{ as } n \rightarrow \infty \end{equation*}

regardless of the initial distribution \(\pi_1(\cdot)\).

This follows from "ergodic theory", which states that for a function \(f: \mathbb{R}^k \rightarrow \mathbb{R}\), \(\frac{1}{n} \sum_{i = 1}^n f(x_i) \rightarrow \int f(x)\pi(x) dx = \mathbb{E}(f(x))\) as \(n \rightarrow \infty\), where \(X \sim \pi(\cdot)\).

DONE Metropolis-Hastings algorithm

We aim to approximate \(\mathbb{E}_b[h(b) | Y] = \int h(b) \mathbb{P}(b|Y)db\), where \(\mathbb{P}(b|Y)\) is the conditional density of \(b|Y\).

We construct a Markov Chain with stationary distribution \(\mathbb{P}(b|Y)\), with samples \(b_1, b_2, \ldots\). Then

\begin{equation*} \frac{1}{n} \sum_{i = 1}^n h(b_i) \rightarrow \mathbb{E}[h(b) | Y] \end{equation*}

Algorithm

Step 1 Initialization: choose an arbitrary point \(b_1\) as the first sample.

Step 2 for each iteration \(t\),

generate a candidate \(\tilde{b}\) from the proposal distribution \(g(\cdot | b_t)\)

calculate the acceptance ratio

\begin{equation*} \alpha = \min \left\{1, \frac{\mathbb{P}(\tilde{b}|Y)g(b_t | \tilde{b})}{\mathbb{P}(b_t|Y)g(\tilde{b} | b_t)} \right\} \end{equation*}

generate a uniform random number from \([0, 1]\), denoted by \(u\).

if \(u \leq \alpha\), accept the candidate: \(b_{t+1} = \tilde{b}\),

if \(u > \alpha\), reject the candidate: \(b_{t+1} = b_t\),

3.8 DONE Generalized estimation equations [4/4]

DONE Estimating functions

\(Y \in \mathbb{R}^n, \theta \in \mathbb{R}^k, g(Y, \theta) \in \mathbb{R}^k\) is estimating function.

If \(\mathbb{E}_\theta g(Y, \theta) = 0\) for all \(\theta \in \Theta\), solve \(g(Y, \theta) = 0\) for \(\theta\) to obtain \(\hat{\theta}\).

\(g(Y, \theta) = 0\) is the _estimating equation.

Let \(\theta^*\) be the true parameter value that generates the data.

For example, when \(g(Y, \theta)\) is score function (derivative of log-likelihood), we have \(\mathbb{E}_{\theta^*}(g(Y, \theta)) = 0\) when \(\theta = \theta^*\).

DONE Class of linear estimating functions

Back to the regression setting, \(\mu = (\mu_1, \ldots, \mu_n)^{\prime}\).

\(\mathbb{E}(Y) = \mu(X, \theta), \text{Cov}(Y) = V, Y \in \mathbb{R}^n, \theta \in \mathbb{R}^k\).

Consider a class of linear estimating functions: \(g(Y, \theta) = H(X, \theta) (Y - \mu(X, \theta))\).

\(H(X, \theta) \in \mathbb{R}^{k \times n}\) depends on \(X\) and \(\theta\) only, not \(Y\).

DONE Optimal linear estimating equation

The choice \(H(X, \theta) = \dot{\mu}^{\prime} V^{-1}\) gives the "best" asymtpotic covariance matrix for the \(\hat{\theta}\) from the estimating equation, where

\begin{equation*} \dot{\mu} = \begin{pmatrix} \frac{\partial \mu_1}{\partial \theta} \\ \vdots \\ \frac{\partial \mu_n}{\partial \theta} \end{pmatrix} \in \mathbb{R}^{n \times k} \text{ (Jacobian Matrix)} \end{equation*}

Quasi-likelihood solution is optimal linear estimating equation solution.

DONE Generalized estimating equation

These methods capture some of the benefits of quasi-likelihood estimation in the context of correlated data models, and are robust in the presence of misspecification of the variance-covariance structure of the data. These estimates are often easier to compute than ML estimates.

For clustered/grouped data: \(Y_i = (Y_{i1}, \ldots, Y_{in_i})^{\prime} \in \mathbb{R}^{n_i}, X_i = (X_{i1}, \ldots, X_{in_i})^{\prime} \in \mathbb{R}^{n_i \times m}, i = 1, 2, \ldots, N\).

Assume the \(Y_i\) are independent, with \(\mathbb{E}(Y_i) = \mu_i(X_i, \theta), \theta \in \mathbb{R}^k, \text{Cov}(Y_i) = V_i \in \mathbb{R}^{n_i \times n_i}\).

Use theory from before of optimal linear estimating equations

\begin{equation*} Y = \begin{pmatrix} Y_1 \\ \vdots \\ Y_N \end{pmatrix}, \mu = \begin{pmatrix} \mu_1 \\ \vdots \\ \mu_N \end{pmatrix}, V = \begin{pmatrix} V_1 & \cdots & 0 \\ 0 & \ddots & 0\\ 0 & \cdots & V_N \end{pmatrix} \end{equation*}

The optimal linear estimating equation:

\begin{equation*} \sum_{i = 1}^N \dot{\mu}_i V_i^{-1} (Y_i - \mu_i) = 0 \\ \dot{\mu}_i = \begin{pmatrix} \frac{\partial \mu_{i1}}{\partial \theta} \\ \vdots \\ \frac{\partial \mu_{in_i}}{\partial \theta} \end{pmatrix} \in \mathbb{R}^{n_i \times k} \end{equation*}

The true \(V_i\) is typically unknown, so we usually estimate it or use a "working" covariance matrix.

4 STT 872 [100%]

4.1 DONE Preparations [9/9]

DONE Measure Theory [4/4]

☑ Definition 1.3
Keener

A collection \(\mathcal{A}\) of subsets of a set \(\mathcal{X}\) is a \(\sigma\) -field (or \(\sigma\) -algebra) if

  1. \(\mathcal{X} \in \mathcal{A}\) and \(\emptyset \in \mathcal{A}\).
  2. If \(A \in \mathcal{A}\), then \(A^c = \mathcal{X} - A \in \mathcal{A}\).
  3. If \(A_1, A_2, \ldots \in \mathcal{A}\), then \(\bigcup_{i = 1}^\infty A_i \in \mathcal{A}\)
☑ Definition 1.4
Keener

A function \(\mu\) on a \(\sigma\) -field \(\mathcal{A}\) of \(\mathcal{X}\) is a measure if

  1. For every \(A \in \mathcal{A}, 0 \leq \mu(A) \leq \infty\); that is, \(\mu : \mathcal{A} \rightarrow [0, \infty]\).
  2. If \(A_1, A_2, \ldots\) are disjoint elements of \(\mathcal{A}\), then
\begin{equation*} \mu \left(\bigcup_{i = 1}^\infty A_i \right) = \sum_{i = 1}^\infty \mu(A_i). \end{equation*}
☑ Example 1.2.1
Lehmann and Casella
☑ Example 1.2.2
Lehmann and Casella

DONE Integration [8/8]

☑ Definition 1.7
Keener

If \((\mathcal{X}, \mathcal{A})\) is a measurable space and \(f\) is a real-valued function on \(\mathcal{X}\), then \(f\) is measurable if

\begin{equation*} f^{-1}(B) \triangleq \{x \in \mathcal{X} : f(x) \in B \} \in \mathcal{A} \end{equation*}

for every Borel set \(B\).

☑ Theorem 1.8
Keener

We can take limit of non-negative simple functions to get any non-negative measurable function.

☑ Example 1.2.3
Lehmann and Casella
☑ Theorem 1.2.5
Lehmann and Casella (Dominated Convergence Theorem)

If \(f\) is the limit of measurable functions \(f_n\), and if a function \(g(x) \geq |f_n(x)|\) for all \(x\), then \(f\) and \(f_n\) are integrable and the integral of \(f\) is the limit of the integrals of \(f_n\).

☑ Lemma 1.2.6
Lehmann and Casella (Fatou's Lemma)

Cannot pull limsup/liminf out of integral of sequence of non-negative functions, get \(\geq\) or \(\leq\).

☑ Section 1.4
Keener

Null sets have measure zero.

A statement holding almost everywhere means it holds for all sets minus null sets.

☑ Definition 1.9
Keener (Absolue continuous)

Let \(P\) and \(\mu\) be measures on a \(\sigma\) -field \(\mathcal{A}\) of \(\mathcal{X}\). Then \(P\) is called absolutely continuous with respect to \(\mu\), written \(P \ll \mu\) if \(P(A) = 0\) whenever \(\mu(A) = 0\).

☑ Theorem 1.10
Keener (Radon-Nikodym Theorem)

If a finite measure \(P\) is absolutely continuous with respect to a \(\sigma\) -finite measure \(\mu\), then there exists a nonnegative measurable function \(f\) such that

\begin{equation*} P(A) = \int_A f d\mu \triangleq \int f 1_A d\mu. \end{equation*}

The function \(f\) in this theorem is called the Radon-Nikodym derivative of \(P\) with respect to \(\mu\), or the density of \(P\) with respect to \(\mu\), denoted

\begin{equation*} f = \frac{dP}{d\mu}. \end{equation*}

DONE Probability Spaces [11/11]

☑ Example 1.11
Keener
\begin{equation*} F_X(x) = P(X \leq x) = P_X((- \infty, x]) = \int_{- \infty}^x p(u) du \end{equation*}
☑ Example 1.12
Keener
\begin{equation*} P(X \in A) = P_X(A) = \int_A p d\mu = \sum_{x \in \mathcal{X}_0} p(x) 1_A(x) \end{equation*}
☑ Section 1.7
Keener
☑ Statistics
p. 16 Keener

Smoothing identity

\begin{equation*} Ef(X, Y) = EE[f(X, Y) | X] \end{equation*}

In particular, when \(f(X, Y) = Y\), then \(EY = EE(Y|X)\)

☑ Example 1.13
Keener (Expectation, Variance, Covariance)
\begin{equation*} Ef(X) = \int f(x)p(x)dx \\ Ef(X) = \sum_{x \in \mathcal{X}_0} f(x)p(x) \\ E(aX + bY) = aE(X) + bE(Y) \\ \text{Var}(X) = E(X - EX)^2 \\ \text{Cov}(X, Y) = E(X - EX)(Y - EY) \end{equation*}
☑ Section 1.9 intro
Keener
☑ Example 1.15
Keener

Product measure of Lebesgue measures is Lebesgue measure.

☑ Example 1.16
Keener

Product measure of counting measures is counting measure.

☑ Theorem 1.17
Keener (Fubini's Theorem)

Integration with respect to product measure can be taken in either order.

☑ Definition 1.18
Keener (Independence)
\begin{equation*} P(X \in A, Y \in B) = P(X \in A) P(Y \in B) \end{equation*}
☑ Proposition 1.19
Keener

Measurable functions taken on independent random vectors are still independent.

DONE Group families [5/5]

☑ Example 1.4.1
Lehmann and Casella

Can add a constant \(a\) to a random variable \(U\) and get another random variable with \(P(X \leq x) = F(x - a)\), where \(F\) is the distribution of \(U\). The totality of these distributions are said to constitute a location family.

A scale family is same thing but multiply a random variable \(U\) by a constant \(b\) and get \(P(X \leq x) = F(x/b)\).

Combine both to get location-scale family, where \(P(X \leq x) = F(\frac{x - a}{b})\).

Location-scale families include normal, double exponential, cauchy, logistic, exponential, and uniform. See p. 18 Table 1.4.1 in Lehmann and Casella.

☑ Definition 1.4.2
Lehmann and Casella

A group must satisfy

  1. you can have multiplication of any two elements (product, \(ab\))
  2. \((ab)c = a(bc)\) (associative law)
  3. \(ea = ae = a\), here \(e\) is called identity and it is in the group
  4. \(aa^{-1} = a^{-1}a = e\), where \(a^{-1}\) is called inverse
☑ Definition 1.4.3
Lehmann and Casella

A transformation group is closed under both composition and inversion.

☑ Example 1.4.4
Lehmann and Casella

Can extend group and scale families to vectors of random variables.

☑ Example 1.4.5
Lehmann and Casella

Multivariate normal distribution can be thought of as location-scale family for vector of standard normals.

DONE Exponential families [5/5]

☑ Section 2.1
intro Keener

Exponential family in canonical form:

\begin{equation*} p_\eta(x) = \text{exp}\left[\sum_{i = 1}^s \eta_i T_i(x) - A(\eta) \right] h(x), \\ x \in \mathcal{R}^n \\ A(\eta) < \infty\end{equation*}

\(\eta\) is called canonical parameter.

Exponential family in alternative form:

\begin{equation*} p_\theta(x) = \text{exp}\left[\sum_{i = 1}^s \eta_i(\theta) T_i(x) - B(\theta) \right] h(x), \\ x \in \mathcal{R}^n \end{equation*}
☑ Example 2.1
Keener
☑ Example 2.3
Keener

Normal distribution is exponential family.

Multivariate normal is exponential family.

In fact, joint density of any exponential family gives exponential family of specific form.

\begin{equation*} \text{exp}\left[\sum_{i = 1}^s \eta_i(\theta) (\sum_{j = 1}^n T_i(x_j) - nB(\theta) \right] \prod_{j = 1}^n h(x_j) \end{equation*}
☑ Theorem 2.4
Keener

Exponential families are guaranteed to have nice-playing integrals for computing moments and expectations

☑ Theorem 1.5.8
Lehmann and Casella

This theorem gives us a shortcut for calculating expected values in exponential families.

DONE Moment and cumulant generating functions [7/7]

Moment generating function for exponential family distribution is

\begin{equation*} M_T(u) = \frac{e^{A(\eta + u)}}{e^{A(\eta})} = Ee^{uT} \end{equation*}

To get \(k^\text{th}\) central moment, take \(k^\text{th}\) derivative of MGF and evaluate at \(u = 0\).

☑ Section 2.4 intro
Keener
☑ Expectations of powers of \(T_1, T_2, \ldots, T_s\),
p. 28 Lehmann and Casella
☑ Lemma 2.7
Keener

The MGF determines the distribution of \(X\), at least if it is finite in some open interval.

☑ Theorem 2.8
Keener

MGF has continous derivatives of all orders at the origin, which allows us to calculate cumulants.

☑ Lemma 2.9
Keener

For independent \(X\) and \(Y\), if \(X\) and \(X\) are both positive, or if \(E|X|\) and \(E|Y|\) are both finite, then

\begin{equation*} EXY = EX \times EY \end{equation*}
☑ Example 1.5.11
Lehmann and Casella

Binomial Moments

☑ Example 1.5.14
Lehmann and Casella

Gamma moments

Models, Estimators, and Risk Functions

This comes from Section 3.1 in Keener.

Inferential statistics is concerned with learning about an unknown parameter \(\theta\) from data \(X\), where \(X\) follows a distribution \(P_\theta\).

\begin{equation*} X \sim P_\theta \end{equation*}

A statistic is a function of the data X. In estimation, the goal is to find a statistic \(\delta\) so that \(\delta(X)\) is close to \(g(\theta)\). Here, \(\delta(X)\) is called an estimator of \(g(\theta)\). We could consider the case where \(g(\theta) = \theta\).

We need a criteria to judge the performance of different estimators. We use decision theory, which starts with a loss function \(L\) which is chosen so that \(L(\theta, d)\) is the loss associated with estimating \(g(\theta)\) by a value \(d\). We assume \(L(\theta, g(\theta)) = 0\), and \(L(\theta, d) \geq 0\) for all \(\theta\) and \(d\). Since \(X\) is random, the loss is random and may be large if we are unlucky, even if \(\delta\) is very good. So instead we judge by average loss, or risk function.

\begin{equation*} R(\theta, \delta) = E_\theta L(\theta, \delta(X)). \end{equation*}

We take expectation when \(X \sim P_\theta\).

Sometimes the risk functions cross for different estimators, so it is not clear which estimator is better.

DONE Sufficient Statistics [22/22]

Basically to find a sufficient statistic, calucalte the joint density and use the facotization criterion.

☑ Section 1.10
Keener

Conditional probability in discrete case:

\begin{equation*} P(Y \in B | X = x) = \frac{P(Y \in B, X = x)}{P(X = x)} \end{equation*}

Smoothing identity

\begin{equation*} Ef(X, Y) = EE \left[ f(X, Y) | X \right] \end{equation*}

When \(f(X, Y) = Y\), we get

\begin{equation*} EY = EE(Y|X) \end{equation*}
☑ Section 3.2 intro
Keener

A sufficient statistic \(T\) provides just as much information about \(\theta\) as the data \(X_1, \ldots, X_n\), as we could construct fake data \(\tilde{X}_1, \ldots, \tilde{X}_n\) equivalent to \(X_1, \ldots, X_n\) using any convenient variable \(U\) that is uniformly distributed on \((0, 1)\).

Heuristic Definition. We say \(T\) is a sufficient statsitic if the statistician who knows the value of T can do just as good a job of estimating the unkonw parameter \(\theta\) as the statistician who knows the entire random sample.

Mathematical Definition Explanation. Let \(T = r(X_1, \ldots, X_n)\) be a sufficient statistic. Statistician A knows the entire random sample \(X_1, \ldots, X_n\), but statistician B only knows the value of \(T\), call it \(t\). Since the conditional distribution of \(X_1, \ldots, X_n\) given \(\theta\) and \(T\) does not depend on \(\theta\), statistician B knows this conditional distribution. So they can use their computer to generate a random sample \(X^\prime_1, \ldots, X^\prime_n\) which has this conditional distribution. But then this random sample has the same distribution as a random sample drawn from the popultaion (with its unknown value of \(\theta\)). So statistician B can use their random samnple \(X^\prime_1, \ldots, X^\prime_n\), and they will (on average) do as well as statistician A.

☑ Definition 3.2
Keener

\(X\) has distribution from a family \(\mathcal{P} = \{ P_\theta : \theta \in \Omega \}\).

\(T = T(X)\) is a sufficient statistic for \(\mathcal{P}\) if for every \(t\) and \(\theta\), the conditional distribution of \(X\) under \(P_\theta\) given \(T = t\) does not depend on \(\theta\).

☑ Theorem 3.3
Keener

Can have a randomized estimator based on sufficient statistic \(T\) that has the same risk function as any estimator \(\delta(X)\) of \(g(\theta)\).

☑ Example 1.6.2
Lehmann and Casella

For sufficient statistic, calculate joint distribution, condition on statistic \(T(X)\), then show it is independent of parameter of interest.

☑ Theorem 3.6
Keener (Factorization Theorem)

The definition of sufficient statistic is less useful for actually finding one, or for truing to determine whether a specific statistic is sufficient. Instead, we use Factorization Theorem for cases where the distributions in the family are specified by densities. To do this, we simply look at the form of the densities.

Write density as

\begin{equation*} p_\theta(x) = g_\theta(T(x))h(x) \\ g_\theta \geq 0 \\ h \geq 0 \end{equation*}

to show that \(T\) is a sufficient statistic.

☑ Example 3.8
Keener

When \(\{x : p_\theta(x) > 0\}\) depends on \(\theta\), care is needed to ensure that formulas for \(p_\theta\) used in the factorization theorem work for all \(x\). To accomplish this, indicator functions are often used. For example, in the case of \(X_1, \ldots, X_n \sim U(\theta, \theta + 1)\).

☑ Definition 3.9
Keener (Minimal sufficient)

From the factorization theorem, if \(T\) is a sufficient statistic for a familiy of distributions \(\mathcal{P}\), and if \(T = f(\tilde{T})\), the \(\tilde{T}\) is also sufficient.

A statistic \(T\) is minimal sufficient if it is sufficient and for every sufficient statistic \(\tilde{T}\), there exists a function \(f\) such that \(T = f(\tilde{T})\).

☑ Example 3.10
Keener

In this example, we have a sample from $N(θ, 1), and we have the sum of the first half of the data and the sum of the second half of the data are together sufficient. However, they are not minimal sufficient. Minimal sufficient would be the full sum of all the data.

☑ Theorem 3.11
Keener

This shows that a statistic is minimal sufficient if there is a one-to-one relation between the statistic and the likelihood shape.

Let \(\mathcal{P} = \{ P_\theta : \theta \in \Omega \}\) with densities \(p_\theta (x) = g_\theta(T(x))h(x)\). If \(p_\theta(x) \propto_\theta p_\theta(y)\) implies \(T(x) = T(y)\), then \(T\) is minimal sufficient.

\(\propto_\theta\) means that the two expressions are proportional when viewed as functions of \(\theta\). So here it means that \(p_\theta(x) = c(x, y) p_\theta(y)\)

It basically says that the shape of the likelihood is minimal sufficient, and so a minimal sufficient "statistic" exists for dominated families.

☑ Example 3.12
Keener

By the factorization theorem, \(T\) is sufficient in an \(s\) -parameter exponential family with densities

\begin{equation*} p_\theta(x) = e^{\eta(\theta) \cdot T(x) - B(\theta)} h(x). \end{equation*}

And using theorem 3.11, \(T\) is also minimal sufficient if we assume \(p_\theta(x) \varpropto_\theta p_\theta (y)\).

☑ Example 3.13
Keener

Order statistics are \(X_{(1)} \leq X_{(2)} \leq \ldots \leq X_{(n)}\). Get order statistics by listing \(X_1, \ldots X_n\) in increasing order. Order statistics are sufficient by factorization theorem. And if we assume \(p_\theta(x) \varpropto_\theta p_\theta (y)\), then they are minimal sufficient for common marginal density

\begin{equation*} f_\theta(x) = \frac{1}{2} e^{- | x - \theta |}. \end{equation*}
☑ Definition 3.14
Keener

Completeness is a technical condition that strengthens sufficiency in a useful fashion.

A statistic \(T\) is complete for a family \(\mathcal{P}\) if

\begin{equation*} E_\theta f(T) = c, \text{ for all } \theta \end{equation*}

implies \(f(T) = c\).

We typically take \(c = 0\) for convenience.

☑ Example 3.16
Keener

TO show completeness, get sufficient statistic \(T\), figure out it's distribution, calculate expectation of \(f(T)\) for generic \(f\) and set equal to \(0\). Show that \(f\) must be \(0\).

☑ Theorem 3.17
Keener

If \(T\) is complete and sufficient, then \(T\) is minimal sufficient. Being complete is the best thing you can do.

☑ Definition 3.18
Keener

Full rank exponential family has two conditions. Needs not empty interior of \(\eta(\Omega)\) and \(T_1, \ldots, T_s\) do not satisfy a linear constraint of the form \(v \cdot T = c\).

☑ Theorem 3.19
Keener

In an exponential family of full rank, \(T\) is complete (and is therefore also minimal).

Recall,

\begin{equation*} p_\theta(x) = \exp\{\eta(\theta) \cdot T(x) - B(\theta) \}h(x), \theta \in \Omega \end{equation*}
☑ Definition 3.20
Keener

A statistic \(V\) is called ancillary if its distribution does not depend on \(\theta\). So \(V\) by itself provides no information about \(\theta\).

For example, distances between order statistics are usually ancillary.

☑ Theorem 3.21
Keener (Basu)

If \(T\) is complete and \(V\) is ancillary, then \(T\) and \(V\) are independent under \(P_\theta\).

☑ Example 3.22
Keener

Sample mean is complete for joint normals. Sample variance is ancillary for the mean parameter. So sample mean and sample variance are independent by Basu.

☑ Example 1.6.25 (i)
Lehmann and Casella

If you have a minimal sufficient statistic, but a function of it equals a constant and is not a constant function, then it is not complete. This is basically a way of generating a counterexample to show not complete. A statistic can therefore be minimal sufficient but not complete.

☑ Example 1.6.26
Lehmann and Casella

DONE Convexity [10/10]

Examples of convex functions are

Function Interval
\(\vert x \vert\) \(-\infty < x < \infty\)
\(x^2\) \(-\infty < x < \infty\)
\(x^p, p > 1\) \(0 < x\)
\(1/x^p, p > 0\) \(0 < x\)
\(e^x\) \(-\infty < x < \infty\)
\(- \log x\) \(0 < x < \infty\)
☑ Definition 3.23
Keener (convex function)

For any two points \(x\) and \(y\), the straight line connecting the points \(f(x)\) and \(f(y)\) should be above the function \(f\) between \(x\) and \(y\).

Convex functions are U-shaped.

☑ Theorem 1.7.2
Lehmann and Casella

For convex functions, the derivative should be non-decreasing. If the derivative is increasing, then it is strictly convex.

The second derivative of a convex function should be non-negative. It should be positive for a strictly convex function.

☑ Theorem 3.24
Keener

We can always find a "support line" under any point \(x\) such that the function is above the line but the line touches \(f\) at \(x\).

☑ Theorem 3.25
Keener (Jensen's Inequaltiy)

For convex function \(f\) with \(EX\) finite, we have

\begin{equation*} f(EX) \leq Ef(X) \end{equation*}

If \(f\) is strictly convex, the inequality is strict unless \(X\) is almost surely constant.

Jensen's inequality also holds in higher dimensions, with \(X\) a random vector.

☑ Example 3.27
Keener
☑ Example 1.7.7
Lehmann and Casella

KL distance between density functions \(f\) and \(g\) is

\begin{equation*} E_f \left[ \log(f(X) / g(X)) \right] = \int \log \left[f(x)/g(x) \right] f(x) dx \end{equation*}

KL distance is always \(\geq 0\).

☑ Theorem 3.28
Keener (Rao-Blackwell)

Recall that is \(\delta(X)\) is an estimator of \(g(\theta)\), then the risk of \(\delta\) for a loss function \(L(\theta, \delta)\) is \(R(\delta, \theta) = E_\theta L(\theta, \delta(X))\). Suppose \(T\) is a sufficient statistic. Then we know there is a randomized estimator based on \(T\) with the same risk as \(\delta\).

The Rao-Blackwell theorem shows that for convex loss functions, there is generally a non-randomized estiamtor based on \(T\) that has smaller risk than \(\delta\).

This result shows that with convex loss functions the only estimators worth considering, at least if estimators are judged soley by their risk, are functions of \(T\) but not \(X\). It can also be used to show that any randomized estimator is worse than a corresponding nonrandomized estimator.

☑ Corollary 1.7.9
Lehmann and Casella

Randomized estimators may be useful in reducing maximum risk, but this can never be the case when the loss function is convex.

☑ Theorem 1.7.13
Lehmann and Casella

Hessian matrix is positive semidefinite for convex function in $k$-dimensional space.

Strictly convex comes when the Hessian matrix is positive definite.

☑ Theorem 1.7.21
Lehmann and Casella

For higher-dimension convex function, we can find "support plane" rather than support line. This allows us to generalize Jensen's inequality. ]

DONE Large sample theory [18/18]

We use this to study and compare the performance of various estimators in large samples.

☑ Definition 8.1
Keener

Convergence in probability, written as \(Y_n \overset{p}{\rightarrow} Y\), if for every \(\epsilon > 0\),

\begin{equation*} P(|Y_n - Y| \geq \epsilon) \rightarrow 0 \end{equation*}

as \(n \rightarrow \infty\).

☑ Theorem 8.2
Keener (Chebychev's Inequality)

For random variable \(X\) and constant \(a > 0\),

\begin{equation*} P(|X| \geq a) \leq \frac{EX^2}{a^2} \end{equation*}
☑ Definition 8.6
Keener

In statistics, there is a family of distributions of interest, indexed by a parameter \(\theta \in \Omega\), and the symbol \(\overset{P_\theta}{\rightarrow}\) is used to denote convergence in probability with \(P_\theta\) as the underlying probability measure.

A sequence of estimators \(\delta_n\), \(n \geq 1\), is consistent for \(g(\theta)\) if for any \(\theta \in \Omega\),

\begin{equation*} \delta_n \overset{P_\theta}{\rightarrow} g(\theta) \end{equation*}

as \(n \rightarrow \infty\).

☑ Theorem 1.8.2
Lehmann and Casella

If MSE goes to zero for all \(\theta\), then you have a consistent estimator.

Equivalenetly, if bias and variance both go to zero asymptotically.

☑ Example 1.8.3
Lehmann and Casella (Weak Law of Large Numbers)

Sample mean is consistent for true mean as long as we have finite variance.

We can see this because variance of the sample mean has \(n\) in the denominator, so variance goes to zero as \(n \rightarrow \infty\).

☑ Proposition 8.5
Keener

For continuous function \(f\) at \(c\), if \(Y_n \overset{p}{\rightarrow} c\), then \(f(Y_n) \overset{p}{\rightarrow} f(c)\).

☑ Definition 8.7
Keener

Convergence in distribution (or law) means that the CDFs converge to the CDF of the target.

For notation we use \(\Rightarrow\).

☑ Example 8.8
Keener

Definition of convergence in distribution only requires pointwise convergence of the CDFs at continuity points of the target CDF.

☑ Theorem 8.9
Keener

\(Y_n \Rightarrow Y\) holds if and only if \(Ef(Y_n) \rightarrow Ef(Y)\) for all bounded continuous functions \(f\).

This generalizes easily to random vectors.

☑ Corollary 8.11
Keener

If \(g\) is a continuous function and \(Y_n \Rightarrow Y\), then

\begin{equation*} g(Y_n) \Rightarrow g(Y) \end{equation*}
☑ Theorem 8.12
Keener (Central Limit Theorem)

Suppose \(X_1, X_2, \ldots, X_n\) are i.i.d. with common mean \(\mu\) and variance \(\sigma^2\). Then

\begin{equation*} \sqrt{n} (\bar{X}_n - \mu) \Rightarrow N(0, \sigma^2) \end{equation*}
☑ Theorem 8.13
Keener (Slutsky's Theorem)

If \(Y_n \Rightarrow Y\), \(A_n \overset{p}{\rightarrow} a\), and \(B_n \overset{p}{\rightarrow} b\), then

\begin{equation*} A_n + B_n Y_n \Rightarrow a + bY \end{equation*}
☑ Proposition 8.14
Keener (Delta Method)

If \(f\) is differentiable at \(\mu\), then

\begin{equation*} \sqrt{n} (f(\bar{X}_n) - f(\mu)) \Rightarrow N(0, [f^\prime (\mu)]^2 \sigma^2) \end{equation*}
☑ Theorem 1.8.14
Lehmann and Casella

If \(\sqrt{n} [T_n - \theta] \Rightarrow N(0, \tau^2)\) and if \(h^\prime(\theta) = 0\), then

\begin{equation*} n[h(T_n) - h(\theta)] \rightarrow \frac{1}{2} h^{\prime \prime} (\theta) \chi_1^2 \end{equation*}
☑ Example 1.8.13
Lehmann and Casella
☑ Section 8.3
Keener

Maximum likelihood estimators may be biased.

☑ Theorem 1.8.22
Lehmann and Casella (Multivariate Delta Method)
☑ Theorem 1.8.21
Lehmann and Casella (Multivariate Central Limit Theorem)

Converges in distribution to multivariate normal. Makes sense.

4.2 DONE Unbiasedness [5/5]

If two estimators are unbiased, then it may not be true that their risk functions will cross, so we will be better able to compare them. We may even be able to identify a best unbiased estimator.

DONE Minimum Variance Unbiased Estimator [9/9]

☑ Definition 2.1.1
Lehmann and Casella

An estimator \(\delta(X)\) of \(g(\theta)\) is unbiased if

\begin{equation*} E_\theta \left[ \delta(X) \right] = g(\theta) \text{ for all } \theta \in \Omega \end{equation*}

If there exists an unbiased estimator of \(g\), then the estimand \(g\) will be called U-estimable.

☑ Example 4.1
Keener

In this example we come up with conditions for an estimator to be unbiased by using the definition of unbiased. Set expectation equal to the parameter value and solve for the estimator.

If we cannot equate both sides, then the parameter (or function of parameter) is not U-estimable (see example 4.2 Keener).

☑ Lemma 2.1.4
Lehmann and Casella

In the search for an unbiased estimator with uniformly minimum risk, a natural approach is to minimize the risk for some particular value of \(\theta_0\) and then see whether the result is independent of \(\theta_0\).

This lemma gives us the totality of unbiased estimators, which are given by \(\delta = \delta_0 - U\) where \(U\) is any unbiased estimator of \(0\) and \(\delta_0\) is any unbiased estimator of \(g(\theta)\).

☑ Example 2.1.5
Lehmann and Casella
Keener p. 62
With squared error loss, the risk of an unbiased estimator is just its variance, so the goal would be to minimize the variance.
☑ Definition 2.1.6
Lehmann and Casella

An unbiased estimator is the uniform minimum variance unbiased (UMVU) estimator if it's variance is \(\leq\) the variance of any other unbiased estimator, for all \(\theta\).

An unbiased estimator is the locally minimum variance unbiased (LMVU) estimator if it's variance is \(\leq\) the variance of any other unbiased estimator, for \(\theta = \theta_0\).

UMVU estimators are unique.

☑ Theorem 2.1.7
Lehmann and Casella

Denote the class of estimators \(\delta\) with \(E_\theta \delta^2 < \infty\) for all \(\theta\) by \(\Delta\).

Let \(\delta\) be an estimator in \(\Delta\), and let \(\mathcal{U}\) denote the est of all unbiased estimators of zero which are in \(\Delta\). Then, a necessary and sufficient condition for \(\delta\) to be a UMVU estimator of its expectation \(g(\theta)\) is that

\begin{equation*} E_\theta(\delta U) = 0 \text{ for all } U \in \mathcal{U} \text{ and all } \theta \in \Omega \end{equation*}
☑ Theorem 4.4
Keener

In general, there is no reason to suspect that there will be a UMVU estimator.

However, if a family has a complete sufficient statistic, a UMVU will exist, at least when \(g\) is U-estimable.

Suppose \(g\) is I-estimable and \(T\) is complete sufficient. Then there is an essentially unique unbiased estimator based on \(T\) that is UMVU.

From the uniqueness assertion in this theorem, if \(T\) is complete sufficient and \(\eta(t)\) is unbiased, then \(\eta(T)\) must be UMVU.

☑ UMVU Method One: Solving for \(\delta\),
Lehmann and Casella p. 88

If \(T\) is a complete sufficient statistic, the UMVU estimator of any U-estimable function \(g(\theta)\) is uniquely determined by the set of equations

\begin{equation*} E_\theta \delta(T) = g(\theta) \text{ for all } \theta \in \Omega \end{equation*}

Basically is we find a function of a complete sufficient statistic whose expectation is \(g(\theta)\), then it will be UMVUE.

See example 4.5 in Keener.

☑ UMVU Method Two: Conditioning,
Lehmann and Casella p. 89

If we take the conditional expectation of any unbiased estimator \(\delta(X)\) given a complete sufficient statistic \(T\), then it will be UMVUE.

It doesn't matter which unbiased estimator \(\delta\) is being conditioned, we should choose whichever one makes our lives easiest.

See example 4.6 in keener.

Second thoughts about bias, Keener 4.2
It may not make sense to only consider unbiased estimators. Estimators with considerable bias may not be worth considering, but estimators with small bias may be quite reasonable.

DONE Continuous distributions and UMVU [4/4]

☑ Section 4.3
Keener

Let \(X \sim N(\mu, \sigma^2)\) and take \(Z = (X - \mu)/\sigma\).

Standard normal is \(Z \sim N(0, 1)\).

Moment generating function of \(Z\) is \(M_Z(u) = e^{u^2 / 2}\), \(u \in \mathbb{R}\).

More generally, for \(X \sim N(\mu, \sigma^2)\), \(aX + b \sim N(a\mu + b, a^2 \sigma^2)\)

Moment generating function of \(X\) is \(M_X(u) = e^{u\mu + u^2 \sigma^2 / 2}\), \(u \in \mathbb{R}\).

Sample mean is

\begin{equation*} \bar{X} = \frac{X_1 + \cdots + X_n}{n} \end{equation*}

Sample variance is

\begin{equation*} S^2 = \frac{1}{n-1} \sum_{i = 1}^{n} (X_i - \bar{X})^2 \end{equation*}

Since one-to-one relationships preserve sufficiency and completeness, \((\bar{X}, S^2)\) is a complete sufficient statistic for \((\mu, \sigma^2)\).

We have

\begin{equation*} \bar{X} \sim N(\mu, \sigma^2 / n) \end{equation*}

Derivation for distribution of \(S^2\) is more involved. We get

\begin{equation*} V = \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}. \end{equation*}

Sum of normals is normal with mean equal to sum of means and variance equal to sum of variances.

☑ Example 2.2.1
Lehmann and Casella
☑ Example 2.2.2
Lehmann and Casella
☑ Example 2.2.5
Lehmann and Casella
Normal One-Sample Problem – Estimation, 4.4 Keener

\(S^2\) is UMVU for \(\sigma^2\). Note that UMVU for \(\sigma\) is not \(S\), although \(S\) is a common and natural choice in practice.

\(\bar{X}\) is the UMVU estimator of \(\mu\). However, \(\bar{X}^2\) is biased for \(\mu^2\). If we substract the bias, we get \(\bar{X}^2 - S^2/n\) is UMVU for \(\mu^2\).

DONE UMVU for discrete distributions [8/8]

☑ Example 2.3.1
Lehmann and Casella
☑ Power series distributions p. 104
Lehmann and Casella
\begin{equation*} P(X = x) = \frac{a(x) \theta^x}{C(\theta)} \\ x = 0, 1, \ldots , \\ \theta > 0 \end{equation*}

Binomial distribution is an example.

☑ Lemma 2.3.6
Lehmann and Casella

If \(X_1, \ldots, X_n\) are from a power series family, then \(X_1 + \cdots + X_n\) is sufficient for \(\theta\), and the distribution of \(T = X_1 + \cdots + X_n\) is the power series family

\begin{equation*} P(T = t) = \frac{A(t, n) \theta^t}{[C(\theta)]^n} \end{equation*}

where \(A(t, n)\) is the coefficient of \(\theta^t\) in the power series expansion of \([C(\theta)]^n\).

It follows that \(T\) is complete and that the UMVU estimator of \(\theta^r\) on the basis of a sample of \(n\) is

\begin{equation*} \frac{A(t - r, n)}{A(t, n)} \text{ if } t \geq r \\ 0 \text{ otherwise } \end{equation*}
☑ Example 2.3.7
Lehmann and Casella
☑ Example 2.3.8
Lehmann and Casella
☑ Example 2.3.9
Lehmann and Casella
☑ Example 2.3.10
Lehmann and Casella
☑ Example 2.3.11
Lehmann and Casella

DONE Information inequality [12/12]

☑ Section 4.5
Keener
\begin{equation*} \text{Cov}(X, Y) = E(X - EX)(Y - EY) = EXY - (EX)(EY) \end{equation*}

So if either \(X\) or \(Y\) have mean \(0\), then \(\text{Cov}(X, Y) = EXY\).

Covariance inequality is

\begin{equation*} \text{Cov}^2(X, Y) \leq \text{Var}(X) \text{Var}(Y) \end{equation*}

Using the covariance inequality, if \(\delta\) is an unbiased estimator of \(g(\theta)\) and \(\psi\) is an arbitrary random variable, then

\begin{equation*} \text{Var}_\theta (\delta) \geq \frac{\text{Cov}^2_\theta (\delta, \psi)}{\text{Var}_\theta (\psi)}. \end{equation*}

The right hand side of this inequality involves \(\delta\), so this seems rather useless as a bound for the variance of \(\delta\). To make headway we need to choose \(\psi\) cleverly, so that \(\text{Cov}_\theta (\delta, \psi)\) is the same for all \(\delta\) that are unbiased for \(g(\theta)\).

☑ Equation 4.13
p. 72 Keener (Hammersley-Chapman-Robbins inequality)

This is a more general form of the Cramer Rao bound, which doesn't require differentiation under the integral sign.

☑ Equation 4.14
p. 72 Keener (Fisher information)
\begin{equation*} I(\theta) = E_\theta \left(\frac{\partial \log p_\theta(X)}{\partial \theta} \right)^2 \end{equation*}
☑ Equation 4.17
p. 73 Keener (Cramer Rao Bound)

Let \(\delta\) have mean \(g(\theta) = E_\theta \delta\) and take \(\psi = \partial \log p_\theta / \partial \theta\). Then with sufficient regularity,

\begin{equation*} g^\prime(\theta) = E_\theta \delta \psi \end{equation*}
☑ Theorem 4.9
Keener (Cramer Rao Bound)

Let \(\mathcal{P} = \{P_\theta : \theta \in \Omega \}\) be a dominated family with \(\Omega\) an open set in \(\mathbb{R}\) and densities \(p_\theta\) differentiable with respect to \(\theta\). If \(E_\theta \psi = 0\), \(E_\theta \delta^2 < \infty\), and \(g^\prime (\theta) = E_\theta \delta \psi\) hold for all \(\theta \in \Omega\), then

\begin{equation*} \text{Var}_\theta (\delta) \geq \frac{\left[g^\prime (\theta) \right]^2}{I(\theta)} \\ \theta \in \Omega \end{equation*}
☑ Equation 4.18
p. 74 Keener (Reparameterization)

You can reparameterize if you have a one-to-one mapping from \(\Xi\) to \(\Omega\).

☑ Example 4.12
Keener

Higher dimesnional example for \(s\) -parameter exponential family.

☑ Example 2.5.5
Lehmann and Casella

If \(X\) is distributed according to an exponential family with \(s = 1\), then

\begin{equation*} I(E_\theta(T)) = \frac{1}{\text{Var}_\theta(T)} \end{equation*}
☑ Example 2.5.6
Lehmann and Casella
☑ Example 4.11
Keener

For location families, \(I(\theta)\) is constant and does not vary with \(\theta\).

☑ Theorem 2.5.12
Lehmann and Casella (Attainment)

Gives conditions under which exponential families attain the information bound.

☑ Example 2.5.13
Lehmann and Casella

DONE Multiparameter information inequality [5/5]

☑ Theorem 2.6.1
Lehmann and Casella

Information matrix is \(s \times s\) matrix

\begin{equation*} I(\theta) = ||I_{ij}(\theta)|| \end{equation*}

where

\begin{equation*} I_{ij}(\theta) = E_\theta \left[ \frac{\partial}{\partial \theta_i} \log p_\theta(X) \cdot \frac{\partial}{\partial \theta_j} \log p_\theta(X) \right] \end{equation*}
☑ Theorem 2.6.2
Lehmann and Casella
☑ Example 2.6.3
Lehmann and Casella
☑ Theorem 2.6.6
Lehmann and Casella
☑ Equation 2.6.27
p. 128 Lehmann and Casella

4.3 DONE Equivariance [1/1]

DONE First Examples [26/26]

☑ Example 3.1.1
Lehmann and Casella

This is an interesting example to motivate the location equivariant problem. If we have \(n\) draws from a binomial, and we are trying to estimate \(p\), we may use the estimator \(\frac{1}{n} \sum X_i\). If, however, we find out that \(p\) is actually the probability of failure, and we want to estimate \(1 -p\), then we can just use \(1 - \frac{1}{n} \sum X_i\). The estimation is invariant under the transormation \(d^\prime = 1 - d\).

☑ Definition 3.1.2
Lehmann and Casella

A family of densities is location invariant if you can add a value to the parameter, and add that same value to the data, and it is equivalent to the originial problem.

☑ Definition 3.1.3
Lehmann and Casella

Location equivariant estimator is

\begin{equation*} \delta(X_1 + a, X_2 + a, \ldots, X_n + a) = \delta(X_1, X_2, \ldots, X_n) + a \text{ for all } a \end{equation*}
☑ Theorem 3.1.4
Lehmann and Casella

Bias, risk, and variance are constant for location equivariant family (they do not depend on the location parameter \(\xi\).

☑ Definition 3.1.5
Lehmann and Casella

Minimum risk equivariant (MRE) estimator is location equivariant estimator that minimizes the constant risk.

☑ Lemma 3.1.6
Lehmann and Casella

This gives a general representation for a location equivariant estimator, as a family made up of functions of a given equivariant estimator.

☑ Lemma 3.1.7
Lehmann and Casella

How do we come up with the function to get family of equivariant estimators. Needs to be function of the differences \(x_i - x_n\) for \(n \geq 2\).

☑ Theorem 3.1.8
Lehmann and Casella

Characterization of equivariant estimators.

☑ Example 3.1.9
Lehmann and Casella

If \(n = 1\), then the only equivariant estimators are \(X + c\) for some constant \(c\).

☑ Theorem 3.1.10
Lehmann and Casella

This gives way to determine equivariant estimator with minimum risk.

☑ Corollary 3.1.11
Lehmann and Casella

Gives conditions for MRE to exist and be unique. Depends on the shape of the loss function. Needs to be convex and not monotone for existence, and strictly convex for uniquenes.

☑ Corollary 3.1.12
Lehmann and Casella

Characterizes the MRE under squared error loss and abolute error loss.

☑ Example 3.1.15
Lehmann and Casella (MRE under 0-1 loss)

In this case, there are two MRE. Not unique.

☑ Example 3.1.16
Lehmann and Casella

\(\bar{X}\) is MRE for estimating mean of normals with known variance.

☑ Theorem 3.1.17
Lehmann and Casella

"least favorable" property of the normal distribution

☑ Example 3.1.18
Lehmann and Casella

An MRE estimator need not be unbiased.

☑ Example 3.1.19
Lehmann and Casella
☑ Theorem 3.1.20
Lehmann and Casella

Use pitman estimator under squared error loss

\begin{equation*} \frac{\int_{-\infty}^{\infty} u f(x_1 - u, \ldots, x_n - u) du}{\int_{-\infty}^{\infty} f(x_1 - u, \ldots, x_n - u) du} \end{equation*}
☑ Example 3.1.21
Lehmann and Casella
☑ Example 3.1.22
Lehmann and Casella
☑ Randomized estimators for equivariant estimation
p. 155 Lehmann and Casella
☑ Lemma 3.1.23
Lehmann and Casella

Under squared error loss, we have

  1. When \(\delta(\mathbf{X})\) is any equivariant estimator with constant bias \(b\), then \(\delta(\mathbf{X}) - b\) is equivariant, unbiased, and has smaller risk than \(\delta(\mathbf{X})\).
  2. The unique MRE estimator is unbiased.
  3. If a UMVU estimator exists and is equivariant, it is MRE.
☑ Definition 3.1.24
Lehmann and Casella
☑ Example 3.1.25
Lehmann and Casella
☑ Example 3.1.26
Lehmann and Casella
☑ Theorem 3.1.27
Lehmann and Casella

If \(\delta\) is MRE for estimating \(\xi\) in a location invariant model with a loss function \(\rho(\delta - \xi)\), then it is risk-unbiased.

This means that it has lower risk than for any other value of the parameter \(\theta\).

4.4 DONE Average risk optimality [6/6]

DONE Introduction [5/5]

We are concerned with minimizing the (weighted) average risk for some suitable non-negative weight function.

\begin{equation*} r(\Lambda, \delta) = \int R(\theta, \delta) d\Lambda(\theta) \end{equation*}

where

\begin{equation*} \int d \Lambda(\theta) = 1 \end{equation*}

An estimator \(\delta\) minimizing \(r(\Lambda, \delta)\) is called a Bayes estimator.

☑ Theorem 4.1.1
Lehmann and Casella

Let \(\Theta\) have distribution \(\Lambda\), and given \(\Theta = \theta\), let \(X\) have distribution \(P_\theta\). Suppose, in addition, the following assumptions hold for the problem of estimating \(g(\Theta)\) with non-negative loss function \(L(\theta, d)\).

  1. There exists an estimator \(\delta_0\) with finite risk.
  2. For almost all \(x\), there exists a value \(\delta_\Lambda(x)\) minimizing
\begin{equation*} E\{L[\Theta, \delta(x)]|X = x\}. \end{equation*}

Then, \(\delta_\Lambda(X)\) is a Bayes estimator.

☑ Corollary 4.1.2
Lehmann and Casella

This tells us what the Bayes estimator is in the case of squared error loss, weighted squared error loss, and absolute error, under the assumptions of the previous theorem.

☑ Example 4.1.3
Lehmann and Casella

Poisson example. We give a Gamma prior. This shows that the choice of loss function can have a large effect on the resulting Bayes estimator.

☑ Corollary 4.1.4
Lehmann and Casella

These are sufficient conditions for a Bayes solution is unique. Basically it requires a convex loss function with finite average risk.

☑ Example 4.1.5
Lehmann and Casella

Binomial example with Beta prior. The Bayes estimator for \(p\) is a weighted average between the UMVU \(X/n\) and the prior mean.

DONE First examples [7/7]

We construct Bayes estimators as functions of the posterior density, and we decide which prior and loss function to use. These choises will affect the properties of the estimator.

☑ Example 4.2.2
Lehmann and Casella

Normal mean example with known variance. This is an example of Normal-Normal conjugacy.

☑ Theorem 4.2.3
Lehmann and Casella

Let \(\Theta\) have a distribution \(\Lambda\), and let \(P_\theta\) denote the conditional distribution of \(X\) given \(\theta\). Consider the estimation of \(g(\theta)\) when the loss function is squared error. Then, no unbiased estimator \(\delta(X)\) can be a Bayes solution unless

\begin{equation*} E[\delta(X) - g(\Theta)]^2 = 0 \end{equation*}

where the expectation is taken with respect to variation in both \(X\) and \(\Theta\).

☑ Example 4.2.4
Lehmann and Casella

Apply the last theorem to sample means.

☑ Example 4.2.5
Lehmann and Casella

Gamma is conjugate for normal precision parameter.

☑ Example 4.2.6
Lehmann and Casella

Normal variance with unknown mean.

☑ Example 4.2.8
Lehmann and Casella

Bayes estimator may still be defined even with improper prior. Improper prior has

\begin{equation*} \int d\Lambda(\theta) = \infty \end{equation*}
☑ Definition 4.2.10
Lehmann and Casella

A nonrandomized estimator \(\delta(x)\) is a limit of Bayes estimators if there exists a seequence of proper priors \(\pi_\nu\) and Bayes estimators \(\delta^{\pi_\nu}\) such that \(\delta^{\pi_\nu}(x) \rightarrow \delta(x)\) a.e. as \(\nu \rightarrow \infty\).

DONE Single Prior Bayes [3/3]

Denote the prior by the density \(\pi(\theta | \gamma)\).

We can hen write a Bayes model in a general form as

\begin{equation*} X|\Theta \sim f(x|\theta), \end{equation*} \begin{equation*} \Theta | \gamma \sim \pi(\theta | \gamma). \end{equation*}

In this section, we assume that the function form of the prior, and the value of \(\gamma\), is known so we have one completely specified prior.

Given a loss function \(L(\theta, d)\), we then look for the estimator that minizes

\begin{equation*} \int L(\theta, d(x)) \pi(\theta | x, \gamma) d \theta, \end{equation*}

where \(\pi(\theta | x, \gamma) = f(x | \theta) \pi(\theta | \gamma) / \int f(x | \theta) \pi (\theta | \gamma) d \theta\).

☑ Example 4.3.1
Lehmann and Casella
☑ Conjugate exponential family
p. 244 Lehmann and Casella

In exponential families, there is a general expression for the conjugate prior distribution and use of this conjugate prior results in a simple expression for the posterior mean. For the density

\begin{equation*} p_\eta (x) = e^{\eta x - A(\eta)} h(x), \end{equation*}

the conjugate prior family is

\begin{equation*} \pi(\eta | \kappa, \mu) = c(\kappa, \mu) e^{\kappa \eta \mu - \kappa A(\eta)}, \end{equation*}

where \(\mu\) can be thought of as a prior mean and \(\kappa\) is proportional to a prior variance.

☑ Example 4.3.7
Lehmann and Casella

We can apply the conjugate prior for exponential families to the gamma distribution.

DONE Hierarchical Bayes [5/5]

☑ Section 4.5 intro
Lehmann and Casella

In a hierarchical Bayes model, rather than specifying the prior distribution as a single function, we specify it in a hierarchy. Thus, we place another level on the model and write

\begin{equation*} X|\theta \sim f(x | \theta), \\ \Theta | \gamma \sim \pi(\theta | \gamma), \\ \Gamma \sim \psi(\gamma), \end{equation*}

where we assume that \(\psi(\cdot)\) is known and not dependent on any other unknown hyperparameters (as parameters of a prior are sometimes called).

This allows us to model relatively complicated situations using a series of simpler steps; that is, both \(\pi(\theta | \gamma)\) and \(\psi(\gamma)\) may be of simple form (even conjugate), but \(\pi(\theta)\) may be more complex.

Given a loss function \(L(\theta, d)\), we would then determine the estimator that minimizes

\begin{equation*} \int L(\theta, d(x))\pi(\theta | x) d\theta \end{equation*}

where

\begin{equation*} \pi(\theta | x) = \frac{\int f(x | \theta) \pi(\theta | \gamma) \psi(\gamma) d\gamma}{\iint f(x | \theta) \pi(\theta | \gamma) \psi(\gamma) d\theta d\gamma} \end{equation*}
☑ Example 4.5.1 & 4.5.2
Lehmann and Casella

Conjugate Normal Hierarchy

\begin{align*} X_i | \theta &\sim N(\theta, \sigma^2), \sigma^2 \text{ known}, \quad i = 1, \ldots, n, \\ \theta|\tau &\sim N(0, \tau^2) \\ \frac{1}{\tau^2} &\sim \text{Gamma}(a, b), \quad a, b \text{ known}. \end{align*}

The hierarchical Bayes estimator of \(\theta\) under squared error loss is

\begin{equation*} \mathbb{E}(\Theta | x) = \mathbb{E}[\mathbb{E}(\Theta | x, \tau^2)] \end{equation*}

Even though at each stage of the model a conjugate prior was used, the resulting Bayes estimator is not from a conjugate prior and it is not expressible in a simple form. Such an occurrence is somewhat commonplace in hierachical Bayes analysis.

☑ Example 4.5.3
Lehmann and Casella

Beta-binomial hierachy

\begin{align*} X|p &\sim \text{binomial}(p, n), \\ p|a, b &\sim \text{beta}(a, b), \\ (a, b) &\sim \psi(a, b), \end{align*}

leading to posterior mean

\begin{equation*} \mathbb{E}(p | x) = \frac{\int_0^1 p^{x + 1}(1-p)^{n - x} \pi(p) dp}{\int_0^1 p^{x}(1-p)^{n - x} \pi(p) dp} \end{equation*}

where

\begin{equation*} \pi(p) = \iint \frac{\Gamma (a + b)}{\Gamma (a) \Gamma (b)} p^{a - 1} (1-p)^{b - 1} \psi(a, b) da db. \end{equation*}

which is difficult to calculate.

☑ Gibbs sampling
p. 256 Lehmann and Casella

Technically, these computations are not approximations, as they are exact in the limit. However, since they involve only a finite number of computations, we think of them as approximations, but realize that any order of precision can be achieved.

Suppose we are interested in calculating the posterior distribution \(\pi(\theta | x)\) (or \(\mathbb{E}(\Theta | x)\), or some other feature of the posterior distribution). We calucalte the full conditionals

\begin{align*} \Theta|x, \gamma &\sim \pi(\theta | x, \gamma), \\ \Gamma |x, \theta &\sim \pi(\gamma | x, \theta), \end{align*}

which are the posterior distributions of each parameter conditional on all others.

If, for \(i = 1, 2, \ldots, M\), random variables are generated according to

\begin{align*} \Theta_i|x, \gamma_{i - 1} &\sim \pi(\theta | x, \gamma_{i - 1}), \\ \Gamma_i |x, \theta_i &\sim \pi(\gamma | x, \theta_i), \end{align*}

this defines a Markov chain \((\Theta_i, \Gamma_i)\). It follows from the rich theory of such chains that there exist distributions \(\pi(\theta | x)\) and \(\pi(\gamma | x)\) such that

\begin{align*} \Theta_i &\rightarrow \Theta \sim \pi(\theta | x), \\ \Gamma_i &\rightarrow \Gamma \sim \pi(\gamma | x) \end{align*}

as \(i \rightarrow \infty\), and

\begin{equation*} \frac{1}{m} \sum_{i = 1}^M h(\Theta_i) \rightarrow \mathbb{E}(h(\Theta)|x) = \int h(\theta)\pi(\theta|x)d\theta \end{equation*}

as \(M \rightarrow \infty\).

☑ Example 4.5.4
Lehmann and Casella

Poisson hierarchy with Gibbs sampling.

\begin{align*} X|\lambda &\sim \text{Poisson}(\lambda) \\ \Lambda|b &\sim \text{Gamma}(a, b), \quad a \text{ known} \\ \frac{1}{b} &\sim \text{Gamma}(k, \tau), \end{align*}

leading to the full conditionals

\begin{align*} \Lambda | x, b &\sim \text{Gamma}\left(a + x, \frac{b}{1 + b} \right) \\ \frac{1}{b} | x, \lambda &\sim \text{Gamma}\left(a + k, \frac{\tau}{1 + \lambda \tau} \right). \end{align*}

DONE Empirical Bayes [5/5]

☑ Section 4.6 intro
Lehmann and Casella

Empircal Bayes estimators tend to be more robust against misspecification of the prior distribution.

Treat \(\gamma\) as an unkown parameter of the model, which also needs to be estimated. We begin with the Bayes model

\begin{align*} X_i | \theta &\sim f(x|\theta), \quad i = 1, \ldots, p, \\ \Theta | \gamma &\sim \pi(\theta | \gamma). \end{align*}

and calculate the marginal distribution of \(X\), with density

\begin{equation*} m(x | \gamma) = \int \prod f(x_i | \theta) \pi(\theta | \gamma) d\theta. \end{equation*}

Based on \(m(x | \gamma)\), we obtain an estimate, \(\hat{\gamma}(x)\), of \(\gamma\). It is common to take \(\hat{\gamma}(x)\) to be the MLE of \(\gamma\). We now substitute \(\hat{\gamma}(x)\) for \(\gamma\) in \(\pi(\theta | \gamma)\) and determine the estimator that minimizes the empircal posterior loss

\begin{equation*} \int L(\theta, \delta(x)) \pi(\theta | x, \hat{\gamma}(x)) d\theta. \end{equation*}

This minimizing estimator is the empirical Bayes estimator.

☑ Example 4.6.1
Lehmann and Casella

Normal empirical Bayes

☑ Example 4.6.2
Lehmann and Casella

Empirical Bayes binomial

☑ Example 4.6.4
Lehmann and Casella

For estimating the natural parameter of an exponential family, the empirical Bayes estimator (using the marginal MLE) can be expressed in the same form as a formal Bayes estimator.

☑ Example 4.6.5
Lehmann and Casella

Hierarchical Bayes approximation

A hierarchical Bayes estimator averages over the hyperparameter, while an empirical Bayes estimator estimates the hyperparameter.

DONE Shrinkage Estimators [4/4]

☑ Example 4.7.1
Lehmann and Casella

The James-Stein estimator.

The James-Stein estimator is an empirical Bayes estimator for the mean of a multivariate normal distribution under sum-of-squared-errors loss.

It has a smaller MSE than the MLE \(X\) for all \(\theta\), where \(X \sim N_p(\theta, \sigma^2 I)\).

However, the James-Stein estimator (or any empirical Bayes estimator) cannot attain as small a Bayes risk as the Bayes estimator.

☑ Lemma 11.1
Keener (Stein's Identity)

Suppose \(x \sim N(\mu, \sigma^2), h: \mathbb{R} \rightarrow \mathbb{R}\) is differentiable (absolutely continuous is also sufficient), and

\begin{equation*} \mathbb{E}|h^{\prime}(X)| < \infty. \end{equation*}

Then

\begin{equation*} \mathbb{E}(X - \mu)h(X) = \sigma^2\mathbb{E}h^{\prime}(X) \end{equation*}
☑ Corollary 4.7.2
Lehmann and Casella

This gives us an unbiased estimator of the risk.

☑ Example 4.7.3
Lehmann and Casella

Bayes risk of the James-Stein estimator.

It turns out that the James-Stein empirical Bayes estimator has a reasonable Bayes risk.

4.5 DONE Minimaxity and admissibility [3/3]

DONE Admissibility [5/5]

☑ Inadmissible definintion
p. 213 Keener

A decision rule \(\delta\) is called inadmissible if a competing rule \(\delta^*\) has a better risk function, specifically if \(R(\theta, \delta^*) \leq R(\theta, \delta)\) for all \(\theta \in \Omega\) with \(R(\theta, \delta^*) < R(\theta, \delta)\) for some \(\theta \in \Omega\). If this happens, we say \(\delta^*\) dominates \(\delta\). All other rules are called admissible.

If your goal is to minimize risk, then you should never want to use an inadmissible rule.

☑ Theorem 11.6
Keener

Let \(\Lambda\) be a prior distribution. The decision rule \(\delta\) is called a Bayes rule for \(\Lambda\) if it minimizes the integrated risk. If a Bayes rule \(\delta\) for \(\Lambda\) is essentially unique, then \(\delta\) is admissible.

☑ Theorem 11.7
Keener

If risk functions for all decision rules are continuous in \(\theta\), if \(\delta\) is Bayes for \(\Lambda\) and has finite integrated risk \(R(\Lambda, \delta) < \infty\), and if the support of \(\Lambda\) is the whole parameter space \(\Omega\), then \(\delta\) is admissible.

☑ Theorem 11.9
Keener

In regular cases amu admissible rule will be a limit of Bayes rules. Unfortunately, some limits may give inadmissible rules. This result gives a sufficient condition for admissibility.

☑ Example 11.10
Keener

This is a Bayesian formulation of the one-sample problem.

DONE Minimax Estimation [12/12]

Choosing an estimator to minimize the maximum risk leads to minimax estimators.

☑ Definition 5.1.1
Lehmann and Casella

An estimator \(\delta^M\) of \(\theta\), which minimzes the maximum risk, that is, which satisfies

\begin{equation*} \inf_\delta \sup_\theta R(\theta, \delta) = \sup_\theta R(\theta, \delta^M), \end{equation*}

is called a minimax estimator.

It is difficult to determine minimax estimators for large classes of problems, so we will have to treat problems individually.

☑ Definition 5.1.3
Lehmann and Casella

Denote the average risk (Bayes risk) of the Bayes solution \(\delta_\Lambda\) by

\begin{equation*} r_\Lambda = r(\Lambda, \delta_\Lambda) = \int R(\theta, \delta_\Lambda) d\Lambda(\theta). \end{equation*}

A prior distribution \(\Lambda\) is least favorable if \(r_\Lambda \geq r_{\Lambda^{\prime}}\) for all prior distributions \(\Lambda^{\prime}\).

This is the prior distributions which causes the statistician the greatest average loss.

☑ Theorem 5.1.4
Lehmann and Casella

Provides a simple condition for a Bayes estimator \(\delta_\Lambda\) to be minimax.

Suppose that \(\Lambda\) is a distribution on \(\Theta\) such that

\begin{equation*} r(\Lambda, \delta_\Lambda) = \int R(\theta, \delta_\Lambda) d\Lambda(\theta) = \sup_\theta R(\theta, \delta_\Lambda). \end{equation*}

(The average of \(R(\theta, \delta_\Lambda)\) is equal to its maximum. This will be the case when the risk function is constant, or, more generally, when \(\Lambda\) assigns probability 1 to the set on which the risk function takes on its maximum values).

Then:

  1. \(\delta_\Lambda\) is minimax.
  2. If \(\delta_\Lambda\) is the unique Bayes solution with respect to \(\Lambda\), it is the unique minimax procedure.
  3. \(\Lambda\) is least favorable.
☑ Corollary 5.1.5
Lehmann and Casella

If a Bayes solution \(\delta_\Lambda\) has constant risk, then it is minimax.

☑ Example 5.1.7
Lehmann and Casella

Binomial example. This is a good example to review.

☑ Example 5.1.8
Lehmann and Casella

Randomized minimax estimator example.

It is possible to reduce the maximum risk through randomization.

☑ Example 5.1.9
Lehmann and Casella

Difference of two binomials example.

☑ Lemma 5.1.10
Lehmann and Casella (helpful)

Let \(\delta\) be a Bayes (respectively, UMVU, minimax, admissible) estimator of \(g(\theta)\) for squared error loss. Then, \(a\delta + b\) is Bayes (respectively, UMVU, minimax, admissible) for \(ag(\theta) + b\).

☑ Definition 5.1.11
Lehmann and Casella

Theorem 5.1.4 implies that a least favorable distribution exists. When such a distribution does not exist, theorem 5.1.4 is not applicable. Consider, for example, the problem of estimating the mean \(\theta\) of a normal distribution with known variance. Since all possible values of \(\theta\) play a completely symmetrical role, in the sense that none is easier to estimate than any other, it is natural to conjecture that the least favorable distribution is "uniform" on the real line, that is, that the least favorabale distribution is Lebesgue measure. This is the Jeffreys prior and, in this case, it is not a proper distribution.

There are two ways in which the approach of Theorem 5.1.4 can be generalized to include improper priors.

  1. It may turn out that the posterior distribution given \(x\) is a proper distribution. One can then compute the expectation \(\mathbb{E}(g(\Theta) | x)\) for this distribution, a generalized Bayes estimator.
  2. Alternatively, one can approximate the improper prior distribution with a sequence of proper distributions.

A sequence of prior distributions \(\{ \Lambda_n \}\) is least favorable if for every prior distribution \(\Lambda\) we have

\begin{equation*} r_\Lambda \leq r = \lim_{n \rightarrow \infty} r_{\Lambda_n}, \end{equation*}

where

\begin{equation*} r_{\Lambda_n} = \int R(\theta, \delta_n) d\Lambda_n(\theta) \end{equation*}

is the Bayes risk under \(\Lambda_n\).

☑ Theorem 5.1.12
Lehmann and Casella

Suppose that \(\{\Lambda_n\}\) is a sequence of prior distributions with Bayes risk \(r_n\) satisfying

\begin{equation*} r_n \leq r = \lim_{n \rightarrow \infty} r_n, \end{equation*}

and that \(\delta\) is an estimator for which

\begin{equation*} \sup_\theta R(\theta, \delta) = r. \end{equation*}

Then

  1. \(\delta\) is minimiax and
  2. the sequence \(\{\Lambda_n\}\) is least favorable.
☑ Lemma 5.1.13
Lehmann and Casella

This lemme helps us evaluate \(r\) in the previous theorem, and hence the Bayes risk \(r_{\Lambda_n}\).

If \(\delta_\Lambda\) is the Bayes estimator of \(g(\theta)\) with respect to \(\Lambda\) and if

\begin{equation*} r_\Lambda = \mathbb{E} [\delta_\Lambda(X) - g(\Theta)]^2 \end{equation*}

is its Bayes risk, then

\begin{equation*} r_\Lambda = \int \text{Var}[g(\Theta | x] dP(x) \end{equation*}

In particular, if the posterior variance of \(g(\Theta)|X\) is independent of \(x\), then

\begin{equation*} r_\Lambda = \text{Var} [g(\Theta)|x] \end{equation*}
☑ Example 5.1.14
Lehmann and Casella

Normal mean example. If we have known variance and conjugate normal prior, as the prior variance \(\rightarrow \infty\), the Bayes risk \(\uparrow \sigma^2/n\), and this shows that \(\bar{X}\) is minimax.

DONE Admissibility and minimaxity in exponential families [13/13]

A UMVU estimator \(\delta\) need not be admissible. If a biased estimator \(\delta^{\prime}\) has uniformly smaller risk, the choice between \(\delta\) and \(\delta^{\prime}\) is not clear cut: One must balance the advantage of unbiasedness against the drawback of larger risk. The situation is, however, different for minimax estimators. If \(\delta^{\prime}\) dominates a minimax estimator \(\delta\), then \(\delta^{\prime}\) is also minimax and, thus, definitely preferred. It is, therefore, particularly important to ascertain whether a proposed minimax estimator is admissible.

☑ Lemma 5.2.1
Lehmann and Casella

To prove inadmissibility of an estimator \(\delta\), it is sufficient to produce an estimator \(\delta^{\prime}\) which dominates it.

Let the range of the estimand \(g(\theta)\) be an interval with end-points \(a\) and \(b\), and suppose that the loss function \(L(\theta, d)\) is positive when \(d \neq g(\theta)\) and zero when \(d = g(\theta)\), and that for any fixed \(\theta\), \(L(\theta, d)\) is increasing as \(d\) moves away from \(g(\theta)\) in either direction. Then, any estimator \(\delta\) taking on values outside the closed interval \([a, b]\) with postiive probability is inadmissible.

☑ Example 5.2.2
Lehmann and Casella

Randomized response example.

☑ Lemma 5.2.4
Lehmann and Casella

Any unique Bayes estimator is admissible.

☑ Example 5.2.5
Lehmann and Casella

Admissibility of linear estimators.

☑ Theorem 5.2.6
Lehmann and Casella

This is an inadmissibility result for linear estimators which is quite general and does not require the assumption of normality.

Let \(X\) be a random variable with mean \(\theta\) and variance \(\sigma^2\). Then \(aX + b\) is an inadmissible estimator of \(\theta\) under squared error loss whenever

  1. \(a > 1\), or
  2. \(a < 0\), or
  3. \(a = 1\) and \(b \neq 0\).
☑ Example 5.2.8
Lehmann and Casella

Two proofs for the admissibility of \(\bar{X}\) for estimating the mean of a normal distribution. The first proof uses the Limiting Bayes Method. The second proof uses the Information Inequality Method.

☑ Lemma 5.2.12
Lehmann and Casella

Let \(X\) and \(Y\) be independent (possibly vector-valued) with distributions \(F_\xi\) and \(G_\eta\), respectively, where \(\xi\) and \(\eta\) vary independently. Then, if \(\delta(X)\) is admissible for estimating \(\xi\) when \(Y\) is not present, it continues to be so in the presence of \(Y\).

☑ Example 5.2.13
Lehmann and Casella

Normal variance.

☑ Theorem 5.2.14
Lehmann and Casella (Karlin's Theorem)

Let X have probability density

\begin{equation*} p_\theta)(x) = \beta(\theta) e^{\theta T(x)} \quad (\theta, T \text{ real valued}) \end{equation*}

with respect to \(\mu\) and let \(\Omega\) be the natural parameter space. Then, \(\Omega\) is an interval, with endpoints say \(\theta_1\) and \(\theta_2\) \((- \infty \leq \theta_1 \leq \theta_2 \leq \infty)\). For estimating \(\mathbb{E}_\theta (T)\), the estimator \(aT + b\) is inadmissible if \(a < 0\) or \(a > 1\) and is constant for \(a = 0\). To state Karlin's sufficient condition in the remaining cases, it is convenient to write the estimator as

\begin{equation*} \delta_{\lambda, \gamma}(x) = \frac{1}{1 + \lambda}T + \frac{\gamma \lambda}{1 + \lambda}, \end{equation*}

with \(0 \leq \lambda < \infty\) corresponding to \(0 < a \leq 1\).

A sufficient condition for the admissibility of \(\delta_{\lambda, \gamma}(x)\) for estimating \(g(\theta) = \mathbb{E}_\theta(T)\) with squared error loss is that the integral of \(e^{-\gamma \lambda \theta}[\beta(\theta)]^{-\lambda}\) diverges at \(\theta_1\) and \(\theta_2\); that is, that for some (and hence for all) \(\theta_1 < \theta_0 < \theta_1\), the two integrals

\begin{equation*} \int_{\theta_0}^{\theta^*} \frac{e^{-\gamma \lambda \theta}}{[\beta(\theta)]^\lambda}d \theta \quad \text{and} \quad \int_{\theta_*}^{\theta^0} \frac{e^{-\gamma \lambda \theta}}{[\beta(\theta)]^\lambda}d \theta \end{equation*}

tend to infinity as \(\theta^*\) tends to \(\theta_2\) and \(\theta_1\), respectively.

☑ Corollary 5.2.18
Lehmann and Casella

If the natural parameter space of \(p_\theta)(x) = \beta(\theta) e^{\theta T(x)}\) is the whole real line so that \(\theta_1 = -\infty, \theta_2 = \infty\), then \(T\) is admissible for estimating \(\mathbb{E}_\theta (T)\) with squared error loss.

☑ Lemma 5.2.19
Lehmann and Casella

If an estimator has constant risk and is admissible, it is minimax.

☑ Corollary 5.2.20
Lehmann and Casella

If the natural parameter space of \(p_\theta)(x) = \beta(\theta) e^{\theta T(x)}\) is the whole real line so that \(\theta_1 = -\infty, \theta_2 = \infty\), \(T\) is the unique minimax estimator of \(g(\theta) = \mathbb{E}_\theta (T)\) for the loss function \([d - g(\theta)]^2/\text{Var}_\theta (T)\).

☑ Lemma 5.2.21
Lehmann and Casella

If an estimator is unique minimax, it is admissible.

4.6 DONE Uniformly most powerful tests [6/6]

DONE Stating the problem [1/1]

☑ Chapter 3 Section 1
Lehmann

We wish to decide whether or not some hypothesis that has been formulated is correct. The choice here lies between only two decisions: accepting or rejecting the hypothesis. A decision procedure for such a problem is called a test of the hypothesis in question.

The decision is made based on the value of a certain random variable \(X\), the distribution \(P_\theta\) of which is known to belong to a class \(\{P_\theta, \theta \in \Omega \}\). We classify these distributions into two groups, those for which the hypothesis is true and those for which it is false. These classes are mutually exclusive and denoted as \(H\) and \(K\), respectively, with corresponding subsets of \(\Omega\) being \(\Omega_H\) and \(\Omega_K\). \(K\) is then the class of alternatives.

There are two decisions to be made

\begin{equation*} d_0 \quad \text{accepting } H, \\ d_1 \quad \text{rejecting } H \end{equation*}

A nonrandomized test procedure assigns to each possible value \(x\) of \(X\) one of these two decisions, and thereby divides the sample space into two complementary regions \(S_0\) and \(S_1\). If \(X\) falls into \(S_0\) the hypothesis is accepted, otherwise it is rejected. \(S_0\) is called the region of acceptance, and \(S_1\) is the region of rejection, or critical region.

There are two types of errors to be made:

\begin{equation*} \text{Type I error: reject } H \text{ when it is true} \\ \text{Type II error: accept } H \text{ when it is false} \end{equation*}

When the number of observations is given, probabilites of both types of error cannot be controlled simultaneously. It is customary therefore to assign a bound to the probability of incorrectly rejecting \(H\) when it is true (Type I error), and to attempt to minimize the other probability (Type II error) subject to this condition. Thus, one selects a number \(\alpha\) between 0 and 1, called the level of significance, and imposes the condition that

\begin{equation*} \mathbb{P}_\theta \{\delta(X) = d_1\} = \mathbb{P}_\theta \{ X \in S_1 \} \leq \alpha \text{ for all } \theta \in \Omega_H. \end{equation*}

Subject to this condition, it is desired to maximize the power which is

\begin{equation*} \mathbb{P}_\theta \{\delta(X) = d_1 \} = \mathbb{P}_\theta \{X \in S_1 \} \text{ for all } \theta \in \Omega_K. \end{equation*}

It is a function of \(\theta\), and we call this function the power function, denoted by \(\beta(\theta)\).

The size of the test is then \(\sup_{\Omega_H} \mathbb{P}_\theta \{ X \in S_1 \}\).

A low significance level results in the hypothesis being rejected only for a small set of values of the observations whose total probability under the hypothesis is small, so that such values would be most unlikely to occur if \(H\) were true.

In applications, there is usually available a nested family of rejection regions, corresponding to different significance levels. It is then good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level \(\hat{\alpha} = \hat{\alpha}(x)\), the critical level, at which the hypothesis would be rejected for the given observation.

Considering the structure of a randomized test, for any value \(x\), such a test chooses among the two decisions, rejection or accepance, with certain probabilities that depend on \(x\) and will be denoted by \(\phi(x)\) and \(1 - \phi(x)\), respectively. A randomized test is therefore completely characterized by a function \(\phi\), the critical function, with \(0 \leq \phi(x) \leq 1\) for all \(x\). If \(\phi\) takes on the values 0 and 1, one is back in the case of a nonrandomized test. the set of points \(x\) for which \(\phi(x) = 1\) is then just the rejection region, so that in a nonrandomized test \(\phi\) is simply the indicator function of the critical region.

The probability of rejection is

\begin{equation*} \mathbb{E}_\theta \phi(X) = \int \phi(x) d\mathbb{P}_\theta (x), \end{equation*}

the conditional probability \(\phi(x)\) of rejection given \(x\), integrated with respect to the probability distribution \(X\). The problem is to select \(\phi\) so as to maximize the power

\begin{equation*} \beta_\phi(\theta) = \mathbb{E}_\theta \phi(X) \text{ for all } \theta \in \Omega_K \end{equation*}

subject to the condition

\begin{equation*} \mathbb{E}_\theta \phi(X) \leq \alpha \text{ for all } \theta \in \Omega_H. \end{equation*}

A uniformly most powerful (UMP) test maximizes the power for all alternatives in \(K\) even when there is more than one.

DONE Neyman-Pearson Fundamental Lemma [4/4]

☑ Chapter 3 Section 2
Lehmann

A class of distributions is called simple if it contains only a single distribution and otherwise is said to be composite.

Let the distribution under a simple hypothesis \(H\) and alternative \(K\) be \(\mathbb{P}_0\) and \(\mathbb{P}_1\), and suppose that these distributions are discrete with \(\mathbb{P}_i \{X = x \} = \mathbb{P}_i(x)\) for \(i = 0, 1\). If we only consider nonrandomized tests, the optimum test is defined as the critical region \(S\) satisfying

\begin{equation*} \sum_{x \in S} \mathbb{P}_0 (x) \leq \alpha \end{equation*}

and

\begin{equation*} \sum_{x \in S} \mathbb{P}_1 (x) = \text{ maximum}. \end{equation*}

So we include points in \(S\) based on the idea of a "limited budget". The most valuable points \(x\) are those with the highest value of

\begin{equation*} r(x) = \frac{\mathbb{P}_1 (x)}{\mathbb{P}_0 (x)}. \end{equation*}

So points are rated according to the value of this ratio and selected for \(S\) in this order, as many as one can afford under \(\sum_{x \in S} \mathbb{P}_0 (x) \leq \alpha\).

It may happen that when a certain point is included, the value \(\alpha\) has not yet been reached but that it would be exceeded if the next point were also included. The exact value of \(\alpha\) can then either not be achieved at all, or it can be attained only by passing over the next desirable point and in its place taking one further down the list. The difficulty can be overcome by permitting randomization. This makes it possible to split the next point, including only a portion of it, and thereby to obtain the exact value \(\alpha\) without breaking the order of the preference that has been established for the various sampling points.

☑ Theorem 3.1
Lehmann (The fundamental lemma of Neyman and Pearson)

Let \(\mathbb{P}_0\) and \(\mathbb{P}_1\) be probability distributions possessing densities \(p_0\) and \(p_1\) respectively with respect to a measure \(\mu\).

i. Existence. For testing \(H: p_0\) against the alternative \(K: p_1\) there exists a test \(\phi\) and a constant \(k\) such that

\begin{equation*} \mathbb{E}_0 \phi(X) = \alpha \end{equation*}

and

\begin{equation*} \phi(x) = 1 \text{ when } p_1(x) > k p_0(x) \\ \phi(x) = 0 \text{ when } p_1(x) < k p_0(x) \end{equation*}

ii. Sufficient condition for most powerful test. If a test satisfies condition 1. for some \(k\), then it is most powerful for testing \(p_0\) against \(p_1\) at level \(\alpha\).

iii. Necessary condition for a most powerful test. If \(\phi\) is most powerful at level \(\alpha\) for testing \(p_0\) against \(p_1\), then for some \(k\) it satisfies

\begin{equation*} \phi(x) = 1 \text{ when } p_1(x) > k p_0(x) \\ \phi(x) = 0 \text{ when } p_1(x) < k p_0(x) \end{equation*}

a.e. \(\mu\). It also satisfies \(\mathbb{E}_0 \phi(X) = \alpha\) unless there exists a test of size \(< \alpha\) and with power \(1\).

☑ Corollary 3.1
Lehmann

Let \(\beta\) denote the power of the most powerful level- \(\alpha\) test \((0 < \alpha < 1)\) for testing \(\mathbb{P}_0\) against \(\mathbb{P}_1\). Then \(\alpha < \beta\) unless \(\mathbb{P}_0 = \mathbb{P}_1\).

☑ Example 12.6
Keener

DONE Distributions with monotone likelihood ratio [8/8]

Besides testing simple hypotheses, we often have distribtions depending on a single real-valued parameter \(\theta\) and the hypothesis is one-sided, say \(H_0: \theta \leq \theta_0\). In general, the most powerful test of \(H\) against an alternative \(\theta_1 > \theta_0\) depends on \(\theta_1\) and is then not UMP. However, a UMP test does exist if an additional assumption is satisfied. The real-parameter family of densities \(p_\theta(x)\) is said to have monotone likelihood ratio if there exists a real-valued function \(T(x)\) such that for any \(\theta < \theta^{\prime}\) the distributions \(\mathbb{P}_\theta\) and \(\mathbb{P}_{\theta^{\prime}}\) are distinct, and the ratio \(p_{\theta^{\prime}}(x) / p_\theta(x)\) is a nondecreasing function of \(T(x)\).

☑ Theorem 3.2
Lehmann

Let \(\theta\) be a real parameter, and let the random variable \(X\) have probability density \(p_\theta(x)\) with monotone likelihood ratio in \(T(x)\).

(i) For testing \(H: \theta \leq \theta_0\) against \(K: \theta > \theta_0\), there exists a UMP test, which is given by

\begin{align*} \phi(x) &= 1 \text{ when } T(x) > C, \\ \phi(x) &= \gamma \text{ when } T(x) = C, \\ \phi(x) &= 0 \text{ when } T(x) < C, \end{align*}

where \(C\) and \(\gamma\) are determined by

\begin{equation*} \mathbb{E}_{\theta_0}\phi(X) = \alpha. \end{equation*}

(ii) The power function

\begin{equation*} \beta(\theta) = \mathbb{E}_\theta \phi(X) \end{equation*}

of this test is strictly increasing for all points \(\theta\) for which \(0 < \beta(\theta) < 1\).

(iii) For all \(\theta^{\prime}\), the test determined in step (i) is UMP for testing \(H^{\prime}: \theta \leq \theta^{\prime}\) against \(K^{\prime}: \theta > \theta^{\prime}\) at level \(\alpha^{\prime} = \beta(\theta^{\prime})\).

(iv) For any \(\theta < \theta_0\) the test minimizes \(\beta(\theta)\) (the probability of error of the first kind) among all tests satisfying \(\mathbb{E}_{\theta_0}\phi(X) = \alpha\).

☑ Example 3.1
Lehmann

Hypergeometric satisfies the assumption of monotone likeligood ratios with \(T(x) = x\). Therefore, there exists a UMP test for testing the one-sided hypothesis testm, which rejects when \(X\) is too large or small (depending on the sides of the test).

☑ Corollary 3.2
Lehmann

One-parameter exponential families satisfy the monotone likelihood ratio assumption.

Let \(\theta\) be a real parameter, and let \(X\) have probability density (with respect to some measure \(\mu\))

\begin{equation*} p_\theta (x) = C(\theta) e^{Q(\theta)T(x)} h(x), \end{equation*}

where \(Q\) is strictly monotone. Then there exists a UMP test \(\phi\) for testing \(H: \theta \leq \theta_0\) against \(K: \theta > \theta_0\). If \(Q\) is increasing,

\begin{equation*} \phi(x) = 1, \gamma, 0 \quad \text{as} \quad T(x) >, =, < C, \end{equation*}

where \(C\) and \(\gamma\) are determined by \(\mathbb{E}_{\theta_0}\phi(X) = \alpha\). If \(Q\) is decreasing, the inequalities are reversed.

☑ Example 3.2
Lehmann

Binomial example of one-parameter exponential family.

☑ Example 3.3
Lehmann

Poisson example of one-parameter exponential family.

☑ Example 12.10
Keener

Monotone likelihood ratios for joint densities of uniform distributions.

☑ Lemma 3.1
Lehmann

Let \(F_0\) and \(F_1\) be two cumulative distribution functions on the real line. Then \(F_1(x) \leq F_0(x)\) for all \(x\) if and only if there exist two nondecreasing functions \(f_0\) and \(f_1\), and a random variable \(V\), such that

(a) \(f_0(v) \leq f_1(v)\) for all \(v\), and

(b) the distributions of \(f_0(V)\) and \(f_1(V)\) are \(F_0\) and \(F_1\), respectively.

☑ Lemma 3.2
Lehmann

Basic properties of families with monotone likelihood ratios.

Let \(p_\theta(x)\) be a family of densities on the real line with monotone likelihood ratio in \(x\).

(i) If \(\psi\) is a nondecreasing function of \(x\), then \(\mathbb{E}_\theta \psi(X)\) is a nondecreasing function of \(\theta\); if \(X_1, \ldots, X_n\) are independently distributed with density \(p_\theta\) and \(\psi^{\prime}\) is a function of \(x_1, \ldots, x_n\) which is nondecreasing in each of its arguments, then \(\mathbb{E}_\theta \psi^{\prime}(X_1, \ldots, X_n)\) is a nondecreasing function of \(\theta\).

(ii) For any \(\theta < \theta^{\prime}\), the cumulative distribution functions of \(X\) under \(\theta\) and \(\theta^{\prime}\) satisfy

\begin{equation*} F_{\theta^{\prime}}(x) \leq F_\theta(x) \quad \text{for all } x. \end{equation*}

(iii) Let \(\psi\) be a function with a single change of sign. More specifically, suppose there exists a value \(x_0\) such that \(\psi(x) \leq 0\) for \(x < x_0\) and \(\psi(x) \geq 0\) for \(x \geq x_0\). Then there exists \(\theta_0\) such that \(\mathbb{E}_\theta \psi(X) \leq 0\) for \(\theta < \theta_0\) and \(\mathbb{E}_\theta \psi(X) \geq 0\) for \(\theta > \theta_0\), unless \(\mathbb{E}_\theta \psi(X)\) is either positive for all \(\theta\) or negative for all \(\theta\).

(iv) Suppose that \(p_\theta(x)\) is positive for all \(\theta\) and all \(x\), that \(p_{\theta^{\prime}}(x) / p_\theta(x)\) is strictly increasing in \(x\) for \(\theta < \theta^{\prime}\), and that \(\psi(x)\) is as in (iii) and is \(\neq 0\) with positive probability. If \(\mathbb{E}_{\theta_0} \psi(X) = 0\), then \(\mathbb{E}_\theta \psi(X) < 0\) for \(\theta < \theta_0\) and \(> 0\) for \(\theta > \theta_0\).

DONE Confidence bounds [6/6]

☑ Chapter 3 Section 5
Lehmann

The theory of UMP one-sided tests can be applied to the problem of obtaining a lower or upper bound for a real-valued parameter \(\theta\). The discussion of lower and upper bounds is completely parallel, and it is therefore enough to consider the case of a lower bound, say \(\underline{\theta}\).

One selects a number \(1 - \alpha\), the confidence level, and restricts attention to bounds \(\underline{\theta}\) satisfying

\begin{equation*} \mathbb{P}_\theta(\underline{\theta}(X) \leq \theta) \geq 1-\alpha \quad \text{for all } \theta. \end{equation*}

A function \(\underline{\theta}\) for which

\begin{equation*} \mathbb{P}_\theta(\underline{\theta}(X) \leq \theta^{\prime}) = \text{minimum} \end{equation*}

for all \(\theta^{\prime} < \theta\) subject to the above condition is a uniformly most accurate lower confidence bound for \(\theta\) at confidence level \(1 - \alpha\).

☑ Theorem 3.4
Lehmann

(i) For each \(\theta_0 \in \Omega\) let \(A(\theta_0)\) be the acceptance region of a level- \(\alpha\) test for testing \(H(\theta_0): \theta = \theta_0\), and for each sample point \(x\) let \(S(x)\) denote the set of parameter values

\begin{equation*} S(x) = \{\theta : x \in A(\theta), \theta \in \Omega \}. \end{equation*}

Then \(S(x)\) is a family of confidence sets for \(\theta\) at confidence level \(1 - \alpha\).

(ii) If for all \(\theta_0, A(\theta_0)\) is UMP for testing \(H(\theta_0)\) at level \(\alpha\) against the alternatives \(K(\theta_0)\), then for each \(\theta_0\) in \(\Omega, S(X)\) minimizes the probability

\begin{equation*} \mathbb{P}_\theta(\theta_0 \in S(X)) \text{ for all } \theta \in K(\theta_0) \end{equation*}

among all level- \((1-\alpha)\) families of confidence sets for \(\theta\).

☑ Corollary 3.3
Lehmann

Let the family of densities \(p_\theta(x), \theta \in \Omega\), have monotone likelihood ratio in \(T(x)\), and suppose that the cumulative distribution function \(F_\theta(t)\) of \(T = T(X)\) is a continuous function in each of the variables \(t\) and \(\theta\) when the other is fixed.

(i) There exists a uniformly most accurate confidence bound \(\underline{\theta}\) for \(\theta\) at each confidence level \(1 - \alpha\).

(ii) If \(x\) denotes the observed values of \(X\) and \(t = T(x)\), and if the equation

\begin{equation*} F_\theta(t) = 1 - \alpha \end{equation*}

has a solution \(\theta = \hat{\theta}\) in \(\Omega\), then this solution is unique and \(\underline{\theta}(x) = \hat{\theta}\).

☑ Example 3.6
Lehmann

Exponential waiting times

☑ Discrete Distributions
p. 93 Lehmann

If the variables \(X\) or \(T\) are discrete, corollary 3.3 cannot be applied directly, since the distribution functions \(F_\theta(t)\) are not continuous, and for most values \(\theta_0\) the optimum tests of \(H: \theta = \theta_0\) are randomized. However, any randomized test based on \(X\) has a representation as a nonrandomized test depending on \(X\) and an independent random variable \(U\) distribued uniformly over \((0, 1)\).

☑ Example 3.7
Lehmann

Binomial example

DONE A generalization of the fundamental lemma [3/3]

☑ Theorem 3.5
Lehmann

This is a useful extension of Theorem 3.1 to the case of more than one side condition.

☑ Corollary 3.4
Lehmann

Let \(p_1, \ldots, p_m, p_{m+1}\) be probability densities with respect to a measure \(\mu\), and let \(0 < \alpha < 1\). Then there exists a test \(\phi\) such that \(\mathbb{E}_i \phi(X) = \alpha (i = 1, \ldots, m)\) and \(\mathbb{E}_{m + 1} \phi(X) > \alpha\), unless \(p_{m+1} = \sum_{i = 1}^m k_i p_i\) a.e. \(\mu\).

☑ Lemma 3.3
Lehmann

This is the method of undetermined multipliers.

Let \(F_1, \ldots, F_{m+1}\) be real-valued functions defined over a space \(U\), and consider the problem of maximizing \(F_{m+1}(u)\) subject to \(F_i(u) = c(i), (i = 1, \ldots, m)\). A sufficient condition for a point \(u^0\) satisfying the side conditions to be a solution of the given problem is that among all points of \(U\) it maximizes

\begin{equation*} F_{m+1}(u) - \sum_{i = 1}^m k_i F_i(u) \end{equation*}

for some \(k_1, \ldots, k_m\).

DONE Two-sided hypotheses [2/2]

Two-sided hypotheses have the form

\begin{equation*} H:\theta \leq \theta_1 \text{ or } \theta \geq \theta_2 \quad (\theta_1 < \theta_2). \end{equation*}
☑ Theorem 3.6
Lehmann

(i) For testing the hypothesis \(H: \theta \leq \theta_1 \text{ or } \theta \geq \theta_2\) \((\theta_1 < \theta_2)\) against the alternatives \(K: \theta_1 < \theta < \theta_2\) in the one-parameter exponential family, there exists a UMP test given by

\begin{align*} \phi(x) &= 1 \text{ when } C_1 < T(x) < C_2 \quad (C_1 < C_2), \\ \phi(x) &= \gamma_i \text{ when } T(x) = C_i, \quad i = 1, 2, \\ \phi(x) &= 0 \text{ when } T(x) < C_1 \text{ or } > C_2. \end{align*}

where the \(C\) 's and \(\gamma\) 's are determined by

\begin{equation*} \mathbb{E}_{\theta_1}\phi(X) = \mathbb{E}_{\theta_2} \phi(X) = \alpha. \end{equation*}

(ii) This test minimizes \(\mathbb{E}_\theta(X)\) subject to \(\mathbb{E}_{\theta_1}\phi(X) = \mathbb{E}_{\theta_2} \phi(X) = \alpha\) for all \(\theta < \theta_1\) and \(> \theta_2\).

(iii) For \(0 < \alpha < 1\) the power function of this test has a maximum at a point \(\theta_0\) between \(\theta_1\) and \(\theta_2\) and decreases strictly as \(\theta\) tends away from \(\theta_0\) in either direction, unless there exist two values \(t_1, t_2\) such that \(\mathbb{P}_\theta(T(X) = t_1) + \mathbb{P}_\theta(T(X) = t_2) = 1\) for all \(\theta\).

☑ Lemhma 3.4
Lehmann

Let \(p_\theta(x)\) be postive for all \(\theta\) and all \(x\) with \(p_{\theta^{\prime}}(x) / p_\theta (x)\) being strictly increasing in \(x\) for \(\theta < \theta^{\prime}\).

(i) If \(\phi\) and \(\phi^*\) are two tests satisfying

\begin{align*} \phi(x) &= 1 \text{ when } C_1 < T(x) < C_2 \quad (C_1 < C_2), \\ \phi(x) &= \gamma_i \text{ when } T(x) = C_i, \quad i = 1, 2, \\ \phi(x) &= 0 \text{ when } T(x) < C_1 \text{ or } > C_2. \end{align*}

and \(\mathbb{E}_{\theta_1} \phi(T) = \mathbb{E}_{\theta_1} \phi^*(T)\), and if \(\phi^*\) is to the right of \(\phi\), then \(\beta(\theta) <\) or \(> \beta^*(\theta)\) as \(\theta > \theta_1\) or \(< \theta_1\).

(ii) If \(\phi\) and \(\phi^*\) satisfy

\begin{align*} \phi(x) &= 1 \text{ when } C_1 < T(x) < C_2 \quad (C_1 < C_2), \\ \phi(x) &= \gamma_i \text{ when } T(x) = C_i, \quad i = 1, 2, \\ \phi(x) &= 0 \text{ when } T(x) < C_1 \text{ or } > C_2. \end{align*}

and

\begin{equation*} \mathbb{E}_{\theta_1}\phi(X) = \mathbb{E}_{\theta_2} \phi(X) = \alpha. \end{equation*}

then \(\phi = \phi^*\) with probability one.

4.7 DONE Unbiased tests [5/5]

☑ Definition 12.25
Keener

A test \(\varphi\) for \(H_0: \theta \in \Omega_0\) versus \(H_1: \theta \in \Omega_1\) with level \(\alpha\) is unbiased if its power \(\beta_\varphi(\theta) = \mathbb{E}_\theta \varphi\) satisfies

\begin{equation*} \beta_\varphi (\theta) \leq \alpha, \text{ for all } \theta \in \Omega_0 \text{ and } \beta_\varphi(\theta) \geq \alpha, \text{ for all } \theta \in \Omega_1. \end{equation*}

If there is a uniformly most powerful test, then it is automatically unbiased.

☑ Lemma 4.1
Lehmann

If the distributions \(\mathbb{P}_\theta\) are such that the power function of every test is continuous, and if \(\phi_0\) is UMP among all tests that are similar on the boundary and is a level- \(\alpha\) test of \(H\), then \(\phi_0\) is UMP unbiased.

☑ Chapter 4 Section 2
Lehmann

Let \(\theta\) be a real parameter, and \(X = (X_1, \ldots, X_n)\) a random vector with probability density (with respect to some measure \(\mu\))

\begin{equation*} p_\theta(x) = C(\theta)e^{\theta T(x)}h(x). \end{equation*}

We have shown that a UMP test exists when the hypothesis \(H\) and the class \(K\) of alternatives are given by

(i) \(H: \theta \leq \theta_0, \quad K: \theta > \theta_0\) (Corollary 3.2)

(ii) \(H: \theta \leq \theta_1\) or \(\theta \geq \theta_2, \quad (\theta_1 < \theta_2)\), \(K: \theta_1 < \theta < \theta_2\) (Theorem 3.6)

We will now show that for

(iii) \(H: \theta_1 \leq \theta \leq \theta_2, K: \theta < \theta_1\) or \(\theta > \theta_2\),

there exists a UMP unbiased test given by

\begin{align*} \phi(x) &= 1 \text{ when } T(x) < C_1 \text{ or } > C_2, \\ \phi(x) &= \gamma_i \text{ when } T(x) = C_i, \quad i = 1, 2, \\ \phi(x) &= 0 \text{ when } C_1 < T(x) < C_2, \end{align*}

where the \(C\) 's and \(\gamma\) 's are determined by

\begin{equation*} \mathbb{E}_{\theta_1} \phi(X) = \mathbb{E}_{\theta_2} \phi(X) = \alpha. \end{equation*}

A closely related problem is that of testing

(iv) \(H: \theta = \theta_0\) against the alternatives \(\theta \neq \theta_0\).

For this there also exists a UMP unbiased test given by

\begin{align*} \phi(x) &= 1 \text{ when } T(x) < C_1 \text{ or } > C_2, \\ \phi(x) &= \gamma_i \text{ when } T(x) = C_i, \quad i = 1, 2, \\ \phi(x) &= 0 \text{ when } C_1 < T(x) < C_2, \end{align*}

but the constants are now determined by

\begin{equation*} \mathbb{E}_{\theta_0} [\phi(X)] = \alpha \end{equation*}

and

\begin{equation*} \mathbb{E}_{\theta_0} [T(X) \phi(X)] = \mathbb{E}_{\theta_0} [T(X)] \alpha. \end{equation*}
☑ Example 4.1
Lehmann

Binomial example for one-parameter exponential family

☑ Example 4.2
Lehmann

Normal variance example for one-parameter exponential family

Author: Elliot Shannon

Created: 2025-03-21 Fri 09:37

Validate