using a get working directory I can see that
right now my working directory is desktop I am going to read a CSV file and the first row contains
information about the variables so I am going to say header is true and
we’ll call this data file s binary so we can run this and you can see binary file
has four hundred observations and four variables we can look at structure using this data we want to create a
predictive model where we want to predict whether or not a student will be
admitted to this college and the variables that help to make this
prediction are GRE GPA and rank so I am going to use n net our variable of interest is admit as a
function of tilde and then dot means I want to use all these three variables
GRE GPA and rank and our data is binary so we run this model and we get some
solution all the values for this data set binary
and let’s store these predictions in P now we can create a table and the
Abell in tab and if you want to look at this tab this is what we get
I said what we have is given this side and here you have values based on the
prediction this means 253 students who applied they
were not admitted and this model also predicts that they are not admitted but
there are 20 students actually they were not admitted but the model predicts that
they should be admitted so this is the MIS classification similarly there were
98 students who were actually admitted but the model says they should not be
admitted and 29 were actually admitted and the model also says or predicts that
they should be admitted so obviously the correct classification based on this
data set is 253 plus 29 divided by the entire dataset which is 400 so let’s
calculate some of diagonal values within the term divided by some of the entire
tab so this gives us about 0.7 zero five so
this is correct classification and if you do one minus we get 0.295 which is
miss classification rate now the question is whether this 70 percent
correct classification is good so let’s look at simply how many students were
actually admitted and how many students were not admitted so I’m going to make a
quick small table so you can see in the data set 127
students were admitted and 273 out of 400 were not admitted one way we can
predict whether or not these applicants will be accepted is using the higher
value here so if you see 273 divided by 400 so if we predict all students will
not be accepted still we’ll be right 68 point two five percent of the time if we
create a statistical model and find that the percentage of accuracy is less than
this number obviously we should not use that model so right now we have
developed logistic regression model and it gives accuracy of 70.5% which is
slightly better than this so at least it is better than this
benchmark number so two four
to model performance evaluation let’s make use of a package our OC R so let’s
make a prediction using the model that we developed my model and the data set
we have is binary and the type of prediction that we want
is probability so we want to predict probability values and store this in T
our Edie so now in PR ad we have like 400
prediction values if you want to look at them you can type PR Edie in fact we can
say head PR Edie so you can see first 6 probability values are given if you look
at head binary you can see that first applicant was not
admitted and our prediction probability is 0.18 which is a very low so
prediction also is that this student should not be admitted so this is a
correct prediction this classification table that we made
so this makes use of a cut-off which is 0.5 so if the probability is below 0.5
it will say prediction is 0 and if the probability is above 0.5 it will say
prediction is 1 so here you can see second probability is point three so
prediction will be that student should not be admitted
whereas reality is that this prudent was admitted so this is a classification
error similarly this is 0.71 so this is more than 0.5 and you can also see that
this student was admitted so this is a correct classification now if you look
at the prediction value so let’s make a histogram of PR IDI
so you can see these probabilities vary between zero and about 0.8 so this is
0.6 so about 0.8 and most of the values are below 0.4 so if we use 0.5 we may
have one type of classification but if we use cutoff let’s say 0.4 or 0.6 so
accuracy or miss classification might change so let’s see what will happen if
we do that and for that I am going to make use of prediction function within
our OCR and we will use these probabilities that we calculated and
stored in trad and we also make use of the actual values so let’s store this in
trad again so now we are going to use performance
using trad that we have and will make use of accuracy values and will store
this in eval for evaluations and then let’s plot eval
so we get this kind of curve so what we have is these cutoff values change from
zero to one and for different cutoff values in this picture we see what is
the accuracy level that we’ll get so this accuracy is overall accuracy so you
can see when cutoff is close to 0.1 accuracy levels are really very low in
fact close to 30% and it rapidly Rises as we increase the cutoff values and
reaches a peak somewhere here so remember 0.5 was our default cutoff
and here we can see what would have been accuracy for different cutoff values now
if you want to identify what is the best value here let’s make a line on this
chart using a beeline we will draw a horizontal line at somewhere here Oh
point seven one so you can see somewhere here we have the peak and then we try to
identify what is this value so this is about 0.45 so we can say vertical line
equals 0.45 so this will give us more or less highest accuracy value for this
model so this is based on our I estimate we are going to use which dot max which
one is maximum and the way this ROC our package is made the data that we have
are stored in some slots so I am going to make use of a slot and our data was
in eval and we are interested in y dot values and then we also specify with
double square brackets that this is in one and suppose we want to store this in
max so before you run this if you simply want to look at what exactly is
contained in eval you can just say evil and hit enter so
we notice that there are lot of values that you see there are y values there
are X values and so on so let’s run this so it will identify what is the maximum
value so if you do simply max what is there in max so it says that it is the
sixty-first value now let’s go into the slot for evil and we are interested in Y
values and double square brackets and then one
more square bracket and we specify this is for the max that we identified in the
last row and let’s store this accuracy value in AC C so now ACC has that value
let’s look at what is contained in AC C so this is 0.7 one seven five so the
highest value here is 0.7 175 and now we want to figure out what is the optimal
cutoff level for that point seven one seven five value so it may not be
exactly 0.45 that we have seen here it may be slightly different
so we are going to make use of the same format so from using slot and eval now
we want values on the x-axis so X dot values and square brackets two times
with one in it and we also specify that against the maximum and let’s call this
s cut for cutoff so if you want to look at how much is the cutoff you can see
this is 0.46 eight so not exactly 0.45 that we were looking at on the graph so
now we can print so if you run this it will give us what
is the accuracy value and what is the cutoff value so when compared to a
default cutoff value of 0.5 that we have here a cutoff value of zero point 4 6 8
3 so this will give us a better accuracy of 0.7 175 remember this is based on the
table that we have seen earlier so this table was just for one situation where
cutoff is 0.5 and it tells us how the model has performed but sometimes what
can happen is instead of focusing on the overall accuracy or miss classification
we are more concerned about predicting more accurately in one group compared to
the other group for example if we have data on bankruptcies and we are trying
to make a prediction whether a company will go bankrupt represented by 1 in
that case our interest may lie more in correctly predicting one rather than
zero so that’s where we can make use of our OC curve we’ll make use of performance
and we will calculate t be our true positive rate so true positive rate
based on this data is 29 divided by 29 plus 98 positive rate is about 22% obviously
this is a very very low accuracy level for correctly predicting one most of the
times one is being misclassified as zero so obviously this one needs big
improvement when we look at the overall model and see that accuracy is 70 1.75
that looks very good but when we have to focus on one
accuracy level of 22% obviously is not very good similarly we also calculate
what is called false positive rate so false positive rate we can calculate
again from the same table here so for example 20 is falsely predicted as one
out of 20 plus 253 so false prediction rate in this case will be 20 divided by
253 plus 20 so false prediction rate is about 7% we can do this calculation and
store this in let’s say our OC because we are going to make a ROC curve so
remember these calculations are based on cut-off value of 0.5 which is default
but when we do ROC curve we’ll also be able to see how is the performance for
different cutoff values so that’s the idea we got this one now let’s plot our OC this is how the ROC curve looks like so
you have true positive rate on the y axis and we have false positive rate on
the x axis so the ideal situation should be that the curve starts somewhere here
at zero zero and goes to this one zero one and then this value which is 1 1 and
that would have classified in a perfect way or accuracy would have been
hundred-percent but in reality based on the data we get these curves which are
not really close to the ideal value we can draw a line in the middle so that
the intercept is 0 and slope B is 1 so this straight line here means without
any model if we say that out of 400 students reject everyone will be right
about 68% of the time so if the model does worse than that so this curve will
be below the line but obviously in this case the model is doing better so this
curve can be compared for different models and we can see which model is
doing better and which model is not doing good we can customize this chart
by adding few more things we can colorize
by saying true so if you run that line you will see now there is a color and
that color is based on the cutoff so for example 0.5 is somewhere here so that
light green color is for cutoff at 0.5 so cutoff values range from point 0 5 up
to 0.72 in this example we can also add a title note that while Abel here is true
positive rate another name for this is sensitivity
also xlabel we can say this is one – specificity so that’s another name for
false positive rate so if you run that you get this title here sensitivity and
one – specificity and of course we can draw this a beeline and see how the
model performs against this benchmark another way people use this ROC curve is
to calculate area under the curve because visually we can see that this
curve is doing better than the benchmark but when you have many curves on this
chart what will happen is it will be difficult to differentiate between the
performances so we need a numeric value what we do is we find area under the
curve and if the area is higher that means model performance is better so
note that for the entire rectangle here the total area is 1 and if you look at
this line area below this line will be 50% obviously area under the curve for
the model that we have built that area will be more than 50% so let’s see how
much we get so we will use performance
PR ad that we had calculated earlier we will get a you see area under curve and
let’s store this in a you see so we’ll make use of this command here
Unleashed and slot a you see that we calculated earlier why dot values so
let’s do this also in a you see so if you simply run a you see so you can see
we get point six nine two one etcetera but if you want to round this to less
number of decimals you can do a you see here you see and let’s say we’ll have
only four so now you see only four decimal values
so let’s add this number to this graph somewhere here using legend let’s say we
want the legend to start somewhere at 0.6 and 0.2 so X value is 0.6 Y value is
0.2 we want a you see we can also give a title the curve is indicated on this graph you
can also choose to change the size of this by using cex if you say 4 it will
be very big so obviously this will not fit 1.2 so let me run this line again
this one is so this is with about 1.2 size

100 thoughts on “ROC Curve & Area Under Curve (AUC) with R – Application Example”

  1. Thanks Bharatendra Rai Sir 🙂

    # ROC Curve & Area Under Curve (AUC) with R – Application Example

    install.packages('aod')
    install.packages('ggplot2')
    library(aod)
    library(ggplot2)

    binary <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/binary.csv")
    str(binary)

    #Logistic Regression Model
    install.packages("nnet")
    library(nnet)

    mymodel <- multinom(admit~.,data = binary)

    #mis classification rate
    p <- predict(mymodel,binary)
    tab <- table(p,binary$admit)
    tab
    1-sum(253,29)/400

    # Model Performance Evaluation
    install.packages("ROCR")
    library(ROCR)
    pred <- predict(mymodel,binary,type = "prob")
    hist(pred)
    pred <- prediction(pred,binary$admit)
    eval <- performance(pred,"acc")
    plot(eval)

    abline(h=0.71,v=.45)

    #Identifying the best cutoff and Accuracy
    eval
    max <- which.max(slot(eval,"y.values")[[1]])
    acc <- slot(eval,"y.values")[[1]][[max]]
    cut <- slot(eval,"x.values")[[1]][[max]]
    print(c(Accuracy=acc,Cutoff = cut))

    #Receiver Operating Chraracteristic (ROC) curve
    pred <- prediction(pred,binary$admit)
    roc <- performance(pred,"tpr","fpr")
    plot(roc,colorize = T,
    main = "ROC Curve",
    ylab = "Sensitivity",
    xlab = "1-Specificity")
    abline(a=0,b=1)

    #AUC
    auc <- performance(pred,"auc")
    auc <- unlist(slot(auc,"y.values"))
    round(auc,3)

    legend(0.6,0.2,auc,title = "AUC",cex = .50)

  2. Hi Professor, i love your videos ,it's very interesting.I'm a PhD student , sometimes i find difficult to have the link between my own variables ( concengrations of elements) and the variables that you work with, that's why ; I wish you have documents well explained concerning the data processing analysis to sends it to me I will be very grateful . Also i want that you sent me the data file.

  3. we an use Deducer package in R to directly run the ROC Curve
    library(Deducer)
    mymodel <- glm(admit~.,data = binary)
    rocplot(mymodel)

  4. Hi sir
    Your all videos are good
    Simple doubt : what is the statistics behind deviding data set as train and test presentages ?
    Some says 90 to 10
    Some says 80 to 20
    Some says 70 to 30 %

    What's yous suggested presentages ? And how ?

  5. Hello sir, can you make a video for ploting ROC curve for SVM. I am getting an error in my code. The error is i am getting is format of prediction is invalid. Thank you

  6. how to check performance for ordinal logistic regression model. By using this method I am getting this error
    'Error in prediction(pedi, heart_A$num) :
    Number of cross-validation runs must be equal for predictions and labels.'

  7. I'm a bit late to the party here, but sure You cannot compare total amount emitted vs accuracy of the model?
    In the dataset 68% were admitted, but that data is 100% accurate. If the model is 70% accurate you could get a result between 273 +-30%. Or am I missing something here? You are comparing apples with oranges?

  8. I have a question about the logistic regression model part. Does the code deal with the whole data? I thought when doing the logistic regression model, you have to divide the data into training set and test set. In the code you've used, does it divide training set and test set automatically?

  9. Sir, a)for multi-class, how you will will come with false positive, false negative b)how to compute ROC for multiclass

  10. Thank you for this great video! And thank you for prompt reply. I have questions.
    If we are doing machine learning, we need to create ROC using predictive model created by test set, correct?
    (in your "Logistic Regression with R" video, you created predictive model using test set. We need to validate the accuracy of the model). Also, if I want to use which.max func to plot the highest values on the eval plot, what code should I use?

  11. Hi,
    Firstly, great video this really helped me to understand the ROC curve and implement it with my data in R. I am analysing diagnostic data for a masters degree research project. I wanted to know how to identify the cutoff value from the value that we take from the accuracy versus cutoff curve or the final ROC curve. The scale goes from 0-1 but my independent variable data ranges from 100 to 10^7 . In short, how do I take the best cutoff value that this analysis outputs and relate/convert this to my independent variable and an exact cutoff value?
    Thanks very much.

  12. Thank you so much Sir..the video was really helpful in providing practical knowledge of dealing with predictive modelling problems in R..Can you please tell me how to apply weight of evidence/ fine classing in R – is there any ready made syntax?

  13. Thank you sir…very clear and crisp explanation. In one video I got all the information. From the explanation in the video, I got how to find cutoff for maximum accuracy, by doing this only one class has got more weight in my dataset. but how to find a threshold value of cutoff(which gives maximum of sensitivity and maximum of specificity).

  14. I have applied the same functions in evaluation of my GAM model where I am not able to produce the confusion matrix. The results shows 2*132 table matrix instead of 2*2 matrix moreover I have 203 'Y" variable in validation data. Why its coming so. Plz help me. Thanking you.

  15. Thank you so much sir, Just want to ask you whether type='response' is same as type='prob' when I am trying to give type='prob' , R is throwing an error like "Error in match.arg(type) :
    'arg' should be one of “link”, “response”, “terms” ?

  16. How to handle if my all data is categorical my predictor features are subject columns with 1 to 8 grades for each subject and
    response variable is subject where we have to predict response variable grades (1-8 ).
    Before applying model I converted all features and response variable into factors is this right step or should i only covert response variable into factors and keep predictors in numerical format

  17. Hi Sir,

    When I try using "prediction" function on a multinomial target variable and a matrix of predicted probabilities, I am getting the error below:

    Error in prediction(preTrainProb015, train015$Delq.Status) :

    Number of cross-validation runs must be equal for predictions and labels.

    In the above error: preTrainProb015 <- predict(model015,train015,type='prob')
    train015$Delq.Status is a multinomial target variable with levels 0, 1 and 5.

    Could you shed some light on this error please? Thank you!

  18. Hi Sir , why did you use multinom function here ? isnt multinom used only if target var have more than 2 categories ? while in this video we have only 2 categories , yes or no ?

  19. Hi, I'm on R studio v. 1.1.423 now and nnet package isn't available and I can't seem to find an equivalent… any ideas what I can use to get the same results? Thanks.

  20. Hi
    Bharatendra, why don't you use glm()? I looked it up and it seems like multinom is used when the dependent has more than 2 levels. In your example, the dependent is admin(no, yes). That's why I'm confused why you chose multinom () instead of glm(). Thank you.

  21. Sir, can you please also attach dataset files/link along with your videos. This would greatly help us in learning by practicing with same data set. Thanks for great videos sir.

  22. Hello, This is great video but I am slightly confused with the probability explanation. YOu mentioned that if the prob prediction score is less than .5 then chance are less than average but doesn't it depend on %age of events in the data on which model is based? If in the sample data, the event rate is 1 out of 4 then probability is .25, so any scores above .25 in the final output mean model is saying that this has higher chances. Not necessarily it has to be above .5.

  23. Thanks for the useful information. I would like to ask you if I can use ROC to measure the effectiveness of the prediction model? And can I use ROC in R software?

  24. Hi Sir, i have calculated cut off for accuracy for my data (~0.475). i would like to know where exactly i should replace default 0.5 with this .475 ?

  25. i am getting following error when i use : pred <- prediction(pred, test_data$Income)
    "Number of cross-validation runs must be equal for predictions and labels."
    What could be the reason?

  26. so simple and precise really easy to understand, I have taken BABI course from great lakes, but their videos sucks, I come here to learn from this guy. Phew.

  27. Further to my earlier comment, also wanted to ask what software you used to create these videos on data analytics. Thanks

  28. sir , whenever I try to perform confusion matrix or roc.curve function on "TEST"data I get below error.
    Error in roc.curve(test$LoanDefault, pred) :

    Response and predicted must have the same length.
    how to debug for this error,
    kindly help.

  29. Hi Dr. Bharatendra Rai, would you be able to make a tutorial on building a logistic regression model using training and validation sets, with performance checking via ROC curve as you have done here? I know you posted one on linear regression, but I thought a logistic model would be very helpful too. Thank you!

  30. Hello Sir , i'm an avid viewer of your videos which truly add value to our ML understanding. Have a quick question that once we determine the Best Value of Cut Offs post Model performance evaluation , should we go back and re-run the Model performance with Best Cut Off values and change the cut off of 0.5 that we considered as thumb rule.

  31. True Postive Rate= True Positive/Actual admit ( 29/(29+20) =0.591
    Fales Positive Rate= False Positive/Actual not admit= (98/(253+98)=0.279

Leave a Reply

Your email address will not be published. Required fields are marked *