Classification and Logistic Regression
We can imagine classification as a special regression problem where we only regress to a set of binary values, 0 and 1. Sometimes, we use -1 and 1 notation as well. We call these the negative class and positive class, respectively.
However, if we apply a regression model here, it does not make sense that we predict any values other than 0 and 1. Therefore, we modify the hypothesis function to be:
It ranges from 0 to 1 as output. This intuitively explains why we call it regression since it outputs in a continuous space. However, the value indicates the probability of belonging to a certain class. So essentially, it is a classifier.
Let’s look at what happens when we take the derivative of the logistic function:
With this prior knowledge, the question is how are we supposed to find
So I always had this doubt on why we even do maximum likelihood in the first place?
It involves thinking of the problem as a optimization / a search problem where we seek a set of parameters which fits the PDF from the data sample.
We define the parameter theta
We assume:
where
we have the likelihood as:
Applying the log, we have:
Then, we can use gradient descent to optimize the likelihood. In updating, we should have:
Note we have a plus sign instead of a minus sign since we are finding the maximum, not the minimum. To find the derivative,
From the first line to the second line, we use the derivative of the logistic function derived above. This gives us the update rule for each dimension of the feature vector. Although we have the same algorithm as LMS in this case, the hypothesis in this case is different. It is not surprising to have the same equation when we talk about the Generalized Linear Model.
Newton's Method
So if we think of the maximizing the likelihood function we can do that in a faster way that is if we take it's derivative then it shall be zero at the maxima and we can prove it's a concave function hence giving us the global maxima.
Here's a good visualization of what it does:
Basically on each iteration after starting from a random point we take the derivative and land where the tangent meets the graph again on the x axis, and this happens in quadratic speed so we reach to our root faster.
Mathematically it's:
Derived from this idea, we can let
Newton's Method for Vector-Valued
If
Although Newton's method converges quadratic-ally, each update is more costly than gradient descent . Read about it here .
Implementation:
Gradient Descent:
def fit(self, x, y, alpha=0.01, num_iters=1000):
"""Run gradient descent to minimize J(theta) for logistic regression.
Args:
x: Training example inputs. Shape (m, n).
y: Training example labels. Shape (m,).
alpha: The learning rate.
num_iters: The number of iterations for gradient descent.
"""
m, n = x.shape
self.theta = np.zeros(n)
for _ in range(num_iters):
h = 1 / (1 + np.exp(-x.dot(self.theta)))
gradient = x.T.dot(h - y) / m
self.theta -= alpha * gradient
Newton's Method:
def fit(self, x, y):
"""Run Newton's Method to minimize J(theta) for logistic regression.
Args:
x: Training example inputs. Shape (m, n).
y: Training example labels. Shape (m,).
"""
m,n = x.shape
self.theta = np.zeros(n)
e = 1e-5
newtheta = np.zeros(n)
while np.linalg.norm(newtheta - self.theta, ord=1) > e:
self.theta = newtheta
h = 1/(1+np.exp(-x.dot(self.theta)))
gradient = x.T.dot(h-y)/m
H = x.T.dot(np.diag(h*(1-h))).dot(x)/m
newtheta = self.theta - np.linalg.inv(H).dot(gradient)
self.theta = newtheta