Here we will discuss the basics of logistic regression and its implementation in Python for Data Science. Logistic regression is a supervised classification algorithm. In the classification problem, the target variable can have only discrete values for given sets of features.
Logistic regression is a regression model. The regression model is built for predicting the probability that given data entry belongs to category numbered as “1”. Logistic regression uses the sigmoid function:
Logistic regression becomes a classification technique only when a decision threshold comes into the picture. Setting of the threshold value is an important aspect of Logistic regression, and it depends on the classification of the problem itself.
The decision for the value of the threshold value is majorly affected by the values of recall and precision. Ideally, both recall and precision should be 1, but it is seldom the case. In case we have a precision-recall tradeoff, the following arguments are used for deciding upon the threshold:
Low precision/high recall –
When we want to reduce the number of false negatives but not reducing the number of false positives, we will choose a decision value that has a low value of precision or high value of recall. Example: In case of a cancer diagnosis application, we would not want any affected patient to be classified as not affected without giving heed if patient is wrongfully diagnosed with cancer. It occurs because we can detect absence of cancer by medical diseases but presence of disease cannot be detected in an already rejected candidate.
High precision/low recall –
When we want to reduce the number of false positives but not reducing false negatives, we will choose a decision value that has a high value of precision or low value of recall. Example: In case we classify whether customers react positively or negatively to personalized advertisement, we should be sure that the customer reacts positively to the advertisement because otherwise a negative reaction will cause a loss of potential sales from the customer.
Based on number of categories, logistic regression will be classified as:
- Binomial: In this the target variable can have two possible types “0” and “1” which will represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
- Multinomial: In this the target variable can have three or more possible types which are not ordered like “disease A” vs “disease B” vs “disease C”.
- Orinal: In this we deal with target variables with ordered categories. Example: In case of a test score we can categorize it as: “very poor”, “poor”, “good”, “very good”. We can give each category a score like 0, 1, 2, 3.
We will now explore the simplest form of Logistic regression known as binomial logistic regression.
Binomial logistic regression:
We will consider an example which maps the number of hours of study with the result of the exam. The result will take only two values i.e., pass(1) and fail(0).
Using gradient descent algorithm:
We can also run other more advanced algorithms in python once we have defined our cost function and the gradient. Such algorithms are:
- BFGS (Broyden-Fletcher-Goldfarb-Shanno algorithm)
- L-BFGS(similar to BFGS but it uses limited memory.
- Conjugate gradient.
Advantages of using these algorithms over gradient descent:
- We don’t need to pick up learning rate.
- It often runs faster but not always.
- It is able to numerically approximate gradient for us.
Disadvantages of using these algorithms over gradient descent:
- It is too complex.
- Until we learn the specifics it will behave similar to a black box.
Multinomial logistic regression:
In this it is possible for the output variable to have more than two possible discrete outputs. If we consider the digital dataset, the output variable will be the digit value and can take values out of (0, 1, 2, 3, 4, 5, 6, 7, 8, 9).
Below we have the implementation of Multinomial Logistic regression by using scikit-learnfor making predictions on digit dataset.
Below are some points which we should think about in Logistic regression in python for data science:
- It will not assume linear relationship between dependent and independent variables, but it will assume a linear relationship between logit of explanatory variables and the response.
- Independent variables can be the power terms or non linear transformation of original independent variables.
- The dependent variables need not be normally distributed but assumes distribution from an exponential family. In case of binary logistic regression we will assume binomial distribution of response.
- We don’t need to satisfy the homogeneity of variance.
- The errors must be independent but not normally distributed.
- Logistic regression takes into account maximum likelihood estimation (MLE) and not the ordinary least squares(OLS) for estimation of parameters, so, I relies on a large sample approximations.