Hi! Welcome back to Data Mining with Weka. In the last lesson, we looked at classification
by regression, how to use linear regression to perform classification tasks. In this lesson
we’re going to look at a more powerful way of doing the same kind of thing. It’s called
“logistic regression”. It’s fairly mathematical, and we’re not going to go into the dirty details
of how it works, but I’d like to give you a flavor of the kinds of things it does and
the basic principles that underline logistic regression. Then, of course, you can use it
yourself in Weka without any problem. One of the things about data mining is that
you can sometimes do better by using prediction probabilities rather than actual classes.
Instead of predicting whether it’s going to be a “yes” or a “no”, you might do better
to predict the probability with which you think it’s going to be a “yes” or a “no”.
For example, the weather is 95% likely to be rainy tomorrow, or 72% likely to be sunny,
instead of saying it’s definitely going to be rainy or it’s definitely going to be sunny. Probabilities are really useful things in
data mining. NaiveBayes produces probabilities; it works in terms of probabilities. We’ve
sen that in an earlier lesson. I’m going to open diabetes and run NaiveBayes. I’m going to use a percentage split with 90%, so that leaves 10% as a test set. Then I’m
going to make sure I output the predictions on those 10%, and run it. I want to look at
the predictions that have been output. This is a 2-class dataset, the classes are tested_negative
and tested_positive, and these are the instances — number 1, number 2, number 3, etc. This
is the actual class — tested_negative, tested_positive, tested_negative, etc. This is the predicted
class — tested_negative, tested_negative, tested_negative, tested_negative, etc. This
is a plus under the error column to say where there’s an error, so there’s an error with
instance number 2. These are the actual probabilities that come out of NaiveBayes. So for instance 1 we’ve got a 99% probability
that it’s negative, and a 1% probability that it’s positive. So we predict it’s going to
be negative; that’s why that’s tested_negative. And in fact we’re correct; it is tested_negative.
This instance, which is actually incorrect, we’re predicting 67% percent for negative
and 33% for positive, so we decide it’s a negative, and we’re wrong. We might have been
better saying that here we’re really sure it’s going to be a negative, and we’re right;
here we think it’s going to be a negative, but we’re not sure, and it turns out that
we’re wrong. Sometimes it’s a lot better to think in terms of the output as probabilities,
rather than being forced to make a binary, black-or-white classification. Other data mining methods produce probabilities,
as well. If I look at ZeroR, and run that, these are the probabilities — 65% versus
35%. All of them are the same. Of course, it’s ZeroR! — it always produces the same
thing. In this case, it always says tested_negative and always has the same probabilities. The
reason why the numbers are like that, if you look at the slide here, is that we’ve chosen
a 90% training set and a 10% test set, and the training set contains 448 negative instances
and 243 positive instances. Remember the “Laplace Correction” in Lesson 3.2? — we add 1 to
each of those counts to get 449 and 244. That gives us a 65% probability for being a negative
instance. That’s where these numbers come from. If we look at J48 and run that, then we get
more interesting probabilities here — the negative and positive probabilities, respectively. You can see where the errors are. These probabilities are all different. Internally, J48 uses probabilities in order
to do its pruning operations. We talked about that when we discussed J48’s
pruning, although I didn’t explain explicitly how the probabilities are derived. The idea of logistic regression is to make
linear regression produce probabilities, too. This gets a little bit hairy. Remember, when we use linear regression for
classification, we calculate a linear function using regression and then apply a threshold
to decide whether it’s a 0 or a 1. It’s tempting to imagine that you can interpret
these numbers as probabilities, instead of thresholding like that, but that’s a mistake. They’re not probabilities. These numbers that come out on the regression
line are sometimes negative, and sometimes greater than 1. They can’t be probabilities, because probabilities
don’t work like that. In order to get better probability estimates,
a slightly more sophisticated technique is used. In linear regression, we have a linear sum. In logistic regression, we have the same linear
sum down here — the same kind of linear sum that we saw before — but we embed it in this
kind of formula. This is called a “logit transform”. A logit transform — this is multi-dimensional
with a lot of different a’s here. If we’ve got just one dimension, one variable,
a1, then if this is the input to the logit transform, the output looks like this: it’s
between 0 and 1. It’s sort of an S-shaped curve that applies
a softer function. Rather than just 0 and then a step function,
it’s soft version of a step function that never gets below 0, never gets above 1, and
has a smooth transition in between. When you’re working with a logit transform,
instead of minimizing the squared error (remember, when we do linear regression we minimize the
squared error), it’s better to choose weights to maximize a probabilistic function called
the “log-likelihood function”, which is this pretty scary looking formula down at the bottom. That’s the basis of logistic regression. We won’t talk about the details any more:
let me just do it. We’re going to use the diabetes dataset. In the last lesson we got 76.8% with classification
by regression. Let me tell you if you do ZeroR, NaiveBayes,
and J48, you get these numbers here. I’m going to find the logistic regression
scheme. It’s in “functions”, and called “Logistic”. I’m going to use 10-fold cross-validation. I’m not going to output the predictions. I’ll just run it — and I get 77.2% accuracy. That’s the best figure in this column, though
it’s not much better than NaiveBayes, so you might be a bit skeptical about whether it
really is better. I did this 10 times and calculated the means
myself, and we get these figures for the mean of 10 runs. ZeroR stays the same, of course, at 65.1%;
it produces the same accuracy on each run. NaiveBayes and J48 are different, and here
logistic regression gets an average of 77.5%, which is appreciably better than the other
figures in this column. You can extend the idea to multiple classes. When we did this in the previous lesson, we
performed a regression for each class, a multi-response regression. That actually doesn’t work well with logistic
regression, because you need the probabilities to sum to 1 over the various different classes. That introduces more computational complexity and needs to be tackled as a joint optimization problem. The result is logistic regression, a popular
and powerful machine learning method that uses the logit transform to predict probabilities directly. It works internally with probabilities, like
NaiveBayes does. We also learned in this lesson about prediction
probabilities that can be obtained from other methods, and how to calculate probabilities
from ZeroR. You can read in the course text about logistic
regression in Section 4.6. Now you should go and do the activity associated
with this lesson. See you soon. Bye for now!

3 thoughts on “Data Mining with Weka (4.4: Logistic regression)”

Leave a Reply

Your email address will not be published. Required fields are marked *