GLMs are here babby!!!.

We need to understand more about exponential family distributions before we get into GLMs.

p(y;η)=b(y)exp(ηTT(y)a(η))

Where all of the terms are:

Quick observation tells us that these functions are only of certain parameters, so T(y) is a function only and only of y and not η.

Some of the common exponential as these functions are:

Bernoulli:
p(y;ϕ)=ϕy(1ϕ)1y=exp(ylogϕ+(1y)log(1ϕ))=exp((log(ϕ1ϕ))y+log(1ϕ))η=log(ϕ/(1ϕ))T(y)=ya(η)=log(1ϕ)=log(1+eη)b(y)=1
Gaussian with variance = 1:
σ2=1p(y;μ)=12πexp(12y2)exp(μy12μ2)η=μT(y)=ya(η)=μ2/2=η2/2b(y)=(1/2π)exp(y2/2)

Overview of GLMs:

Pasted image 20240612115714.png
Pasted image 20240612115721.png

Softmax Regression:

In a broader case, we can have multiple classes instead of the binary classes above. It is natural to model it as a Multinomial distribution, which also belongs to the exponential family that can be derived from the Generalized Linear Model (GLM).

In multinomial, we can define ϕ1,ϕ2,,ϕk1 to be the corresponding probabilities of k1 classes. We do not need all k classes since the last one is determined once the previous k1 are set. So we can write ϕk=1i=1k1ϕi.

We first define T(y)Rk1 and :

T(1)=[100],T(2)=[010],,T(k)=[000]

Note that for T(k), we just have all zeros in the vector since the length of the vector is k1. We let T(y)i define the i-th element in the vector.

Now, we show the steps to derive the Multinomial distribution as an exponential family:

p(y;ϕ)=ϕ1[y=1]ϕ2[y=2]ϕk[y=k]=ϕ1[y=1]ϕ2[y=2]ϕk1i=1k1[y=i]=ϕ1T(y)1ϕ2T(y)2ϕk1i=1k1T(y)i=exp(T(y)1log(ϕ1ϕk)+T(y)2log(ϕ2ϕk)++T(y)k1log(ϕk1ϕk)+log(ϕk))=b(y)exp(ηTT(y)a(η))

where

η=[log(ϕ1ϕk)log(ϕ2ϕk)log(ϕk1ϕk)]

and

a(η)=log(ϕk),b(y)=1.

This formulates the multinomial distribution as an exponential family. We can now have the link function as:

ηi=log(ϕiϕk)

To get the response function, we need to invert the link function:

eηi=ϕiϕkϕkeηi=ϕiϕki=1keηi=i=1kϕi

Then, we have the response function:

ϕi=eηij=1keηj

This response function is called the softmax function.

From the assumption (3) in GLM, we know that ηi=θiTx for i=1,2,,k1 and θiRn+1 is the parameter of our GLM model and θk is just 0 so that ηk=0. Now, we have the model based on x:

p(y=i|x;θ)=ϕi=eθiTxj=1keθjTx

This model is called softmax regression, which is a generalization of logistic regression. Thus, the hypothesis will be:

hθ(x)=E[T(y)|x;θ]=[ϕ1ϕ2ϕk1]=[eθ1Txj=1keθjTxeθ2Txj=1keθjTxeθk1Txj=1keθjTx]

Now, we need to fit θ such that we can maximize the log-likelihood. By definition, we can write it out:

L(θ)=i=1mlog(p(y(i)|x(i);θ))=i=1mlog(l=1k(eθlTxj=1keθjTx)1{y(i)=l})

We can use gradient descent or Newton’s method to find the maximum of it.

Note: Logistic regression is a binary case of softmax regression. The sigmoid function is a binary case of the softmax function.

Overview simplified:

Pasted image 20240612121923.png