# Data Mining with Weka (4.4: Logistic regression)

Hi! Welcome back to Data Mining with Weka. In the last lesson, we looked at classification

by regression, how to use linear regression to perform classification tasks. In this lesson

we’re going to look at a more powerful way of doing the same kind of thing. It’s called

“logistic regression”. It’s fairly mathematical, and we’re not going to go into the dirty details

of how it works, but I’d like to give you a flavor of the kinds of things it does and

the basic principles that underline logistic regression. Then, of course, you can use it

yourself in Weka without any problem. One of the things about data mining is that

you can sometimes do better by using prediction probabilities rather than actual classes.

Instead of predicting whether it’s going to be a “yes” or a “no”, you might do better

to predict the probability with which you think it’s going to be a “yes” or a “no”.

For example, the weather is 95% likely to be rainy tomorrow, or 72% likely to be sunny,

instead of saying it’s definitely going to be rainy or it’s definitely going to be sunny. Probabilities are really useful things in

data mining. NaiveBayes produces probabilities; it works in terms of probabilities. We’ve

sen that in an earlier lesson. I’m going to open diabetes and run NaiveBayes. I’m going to use a percentage split with 90%, so that leaves 10% as a test set. Then I’m

going to make sure I output the predictions on those 10%, and run it. I want to look at

the predictions that have been output. This is a 2-class dataset, the classes are tested_negative

and tested_positive, and these are the instances — number 1, number 2, number 3, etc. This

is the actual class — tested_negative, tested_positive, tested_negative, etc. This is the predicted

class — tested_negative, tested_negative, tested_negative, tested_negative, etc. This

is a plus under the error column to say where there’s an error, so there’s an error with

instance number 2. These are the actual probabilities that come out of NaiveBayes. So for instance 1 we’ve got a 99% probability

that it’s negative, and a 1% probability that it’s positive. So we predict it’s going to

be negative; that’s why that’s tested_negative. And in fact we’re correct; it is tested_negative.

This instance, which is actually incorrect, we’re predicting 67% percent for negative

and 33% for positive, so we decide it’s a negative, and we’re wrong. We might have been

better saying that here we’re really sure it’s going to be a negative, and we’re right;

here we think it’s going to be a negative, but we’re not sure, and it turns out that

we’re wrong. Sometimes it’s a lot better to think in terms of the output as probabilities,

rather than being forced to make a binary, black-or-white classification. Other data mining methods produce probabilities,

as well. If I look at ZeroR, and run that, these are the probabilities — 65% versus

35%. All of them are the same. Of course, it’s ZeroR! — it always produces the same

thing. In this case, it always says tested_negative and always has the same probabilities. The

reason why the numbers are like that, if you look at the slide here, is that we’ve chosen

a 90% training set and a 10% test set, and the training set contains 448 negative instances

and 243 positive instances. Remember the “Laplace Correction” in Lesson 3.2? — we add 1 to

each of those counts to get 449 and 244. That gives us a 65% probability for being a negative

instance. That’s where these numbers come from. If we look at J48 and run that, then we get

more interesting probabilities here — the negative and positive probabilities, respectively. You can see where the errors are. These probabilities are all different. Internally, J48 uses probabilities in order

to do its pruning operations. We talked about that when we discussed J48’s

pruning, although I didn’t explain explicitly how the probabilities are derived. The idea of logistic regression is to make

linear regression produce probabilities, too. This gets a little bit hairy. Remember, when we use linear regression for

classification, we calculate a linear function using regression and then apply a threshold

to decide whether it’s a 0 or a 1. It’s tempting to imagine that you can interpret

these numbers as probabilities, instead of thresholding like that, but that’s a mistake. They’re not probabilities. These numbers that come out on the regression

line are sometimes negative, and sometimes greater than 1. They can’t be probabilities, because probabilities

don’t work like that. In order to get better probability estimates,

a slightly more sophisticated technique is used. In linear regression, we have a linear sum. In logistic regression, we have the same linear

sum down here — the same kind of linear sum that we saw before — but we embed it in this

kind of formula. This is called a “logit transform”. A logit transform — this is multi-dimensional

with a lot of different a’s here. If we’ve got just one dimension, one variable,

a1, then if this is the input to the logit transform, the output looks like this: it’s

between 0 and 1. It’s sort of an S-shaped curve that applies

a softer function. Rather than just 0 and then a step function,

it’s soft version of a step function that never gets below 0, never gets above 1, and

has a smooth transition in between. When you’re working with a logit transform,

instead of minimizing the squared error (remember, when we do linear regression we minimize the

squared error), it’s better to choose weights to maximize a probabilistic function called

the “log-likelihood function”, which is this pretty scary looking formula down at the bottom. That’s the basis of logistic regression. We won’t talk about the details any more:

let me just do it. We’re going to use the diabetes dataset. In the last lesson we got 76.8% with classification

by regression. Let me tell you if you do ZeroR, NaiveBayes,

and J48, you get these numbers here. I’m going to find the logistic regression

scheme. It’s in “functions”, and called “Logistic”. I’m going to use 10-fold cross-validation. I’m not going to output the predictions. I’ll just run it — and I get 77.2% accuracy. That’s the best figure in this column, though

it’s not much better than NaiveBayes, so you might be a bit skeptical about whether it

really is better. I did this 10 times and calculated the means

myself, and we get these figures for the mean of 10 runs. ZeroR stays the same, of course, at 65.1%;

it produces the same accuracy on each run. NaiveBayes and J48 are different, and here

logistic regression gets an average of 77.5%, which is appreciably better than the other

figures in this column. You can extend the idea to multiple classes. When we did this in the previous lesson, we

performed a regression for each class, a multi-response regression. That actually doesn’t work well with logistic

regression, because you need the probabilities to sum to 1 over the various different classes. That introduces more computational complexity and needs to be tackled as a joint optimization problem. The result is logistic regression, a popular

and powerful machine learning method that uses the logit transform to predict probabilities directly. It works internally with probabilities, like

NaiveBayes does. We also learned in this lesson about prediction

probabilities that can be obtained from other methods, and how to calculate probabilities

from ZeroR. You can read in the course text about logistic

regression in Section 4.6. Now you should go and do the activity associated

with this lesson. See you soon. Bye for now!

why is it naive bayes instead of logistic regression at the beginning?

Is there any "stepwise" selection or "attribute selection" that is good with logistic regression?

weka !!!!!