# ROC Curve & Area Under Curve (AUC) with R – Application Example

using a get working directory I can see that

right now my working directory is desktop I am going to read a CSV file and the first row contains

information about the variables so I am going to say header is true and

we’ll call this data file s binary so we can run this and you can see binary file

has four hundred observations and four variables we can look at structure using this data we want to create a

predictive model where we want to predict whether or not a student will be

admitted to this college and the variables that help to make this

prediction are GRE GPA and rank so I am going to use n net our variable of interest is admit as a

function of tilde and then dot means I want to use all these three variables

GRE GPA and rank and our data is binary so we run this model and we get some

solution all the values for this data set binary

and let’s store these predictions in P now we can create a table and the

Abell in tab and if you want to look at this tab this is what we get

I said what we have is given this side and here you have values based on the

prediction this means 253 students who applied they

were not admitted and this model also predicts that they are not admitted but

there are 20 students actually they were not admitted but the model predicts that

they should be admitted so this is the MIS classification similarly there were

98 students who were actually admitted but the model says they should not be

admitted and 29 were actually admitted and the model also says or predicts that

they should be admitted so obviously the correct classification based on this

data set is 253 plus 29 divided by the entire dataset which is 400 so let’s

calculate some of diagonal values within the term divided by some of the entire

tab so this gives us about 0.7 zero five so

this is correct classification and if you do one minus we get 0.295 which is

miss classification rate now the question is whether this 70 percent

correct classification is good so let’s look at simply how many students were

actually admitted and how many students were not admitted so I’m going to make a

quick small table so you can see in the data set 127

students were admitted and 273 out of 400 were not admitted one way we can

predict whether or not these applicants will be accepted is using the higher

value here so if you see 273 divided by 400 so if we predict all students will

not be accepted still we’ll be right 68 point two five percent of the time if we

create a statistical model and find that the percentage of accuracy is less than

this number obviously we should not use that model so right now we have

developed logistic regression model and it gives accuracy of 70.5% which is

slightly better than this so at least it is better than this

benchmark number so two four

to model performance evaluation let’s make use of a package our OC R so let’s

make a prediction using the model that we developed my model and the data set

we have is binary and the type of prediction that we want

is probability so we want to predict probability values and store this in T

our Edie so now in PR ad we have like 400

prediction values if you want to look at them you can type PR Edie in fact we can

say head PR Edie so you can see first 6 probability values are given if you look

at head binary you can see that first applicant was not

admitted and our prediction probability is 0.18 which is a very low so

prediction also is that this student should not be admitted so this is a

correct prediction this classification table that we made

so this makes use of a cut-off which is 0.5 so if the probability is below 0.5

it will say prediction is 0 and if the probability is above 0.5 it will say

prediction is 1 so here you can see second probability is point three so

prediction will be that student should not be admitted

whereas reality is that this prudent was admitted so this is a classification

error similarly this is 0.71 so this is more than 0.5 and you can also see that

this student was admitted so this is a correct classification now if you look

at the prediction value so let’s make a histogram of PR IDI

so you can see these probabilities vary between zero and about 0.8 so this is

0.6 so about 0.8 and most of the values are below 0.4 so if we use 0.5 we may

have one type of classification but if we use cutoff let’s say 0.4 or 0.6 so

accuracy or miss classification might change so let’s see what will happen if

we do that and for that I am going to make use of prediction function within

our OCR and we will use these probabilities that we calculated and

stored in trad and we also make use of the actual values so let’s store this in

trad again so now we are going to use performance

using trad that we have and will make use of accuracy values and will store

this in eval for evaluations and then let’s plot eval

so we get this kind of curve so what we have is these cutoff values change from

zero to one and for different cutoff values in this picture we see what is

the accuracy level that we’ll get so this accuracy is overall accuracy so you

can see when cutoff is close to 0.1 accuracy levels are really very low in

fact close to 30% and it rapidly Rises as we increase the cutoff values and

reaches a peak somewhere here so remember 0.5 was our default cutoff

and here we can see what would have been accuracy for different cutoff values now

if you want to identify what is the best value here let’s make a line on this

chart using a beeline we will draw a horizontal line at somewhere here Oh

point seven one so you can see somewhere here we have the peak and then we try to

identify what is this value so this is about 0.45 so we can say vertical line

equals 0.45 so this will give us more or less highest accuracy value for this

model so this is based on our I estimate we are going to use which dot max which

one is maximum and the way this ROC our package is made the data that we have

are stored in some slots so I am going to make use of a slot and our data was

in eval and we are interested in y dot values and then we also specify with

double square brackets that this is in one and suppose we want to store this in

max so before you run this if you simply want to look at what exactly is

contained in eval you can just say evil and hit enter so

we notice that there are lot of values that you see there are y values there

are X values and so on so let’s run this so it will identify what is the maximum

value so if you do simply max what is there in max so it says that it is the

sixty-first value now let’s go into the slot for evil and we are interested in Y

values and double square brackets and then one

more square bracket and we specify this is for the max that we identified in the

last row and let’s store this accuracy value in AC C so now ACC has that value

let’s look at what is contained in AC C so this is 0.7 one seven five so the

highest value here is 0.7 175 and now we want to figure out what is the optimal

cutoff level for that point seven one seven five value so it may not be

exactly 0.45 that we have seen here it may be slightly different

so we are going to make use of the same format so from using slot and eval now

we want values on the x-axis so X dot values and square brackets two times

with one in it and we also specify that against the maximum and let’s call this

s cut for cutoff so if you want to look at how much is the cutoff you can see

this is 0.46 eight so not exactly 0.45 that we were looking at on the graph so

now we can print so if you run this it will give us what

is the accuracy value and what is the cutoff value so when compared to a

default cutoff value of 0.5 that we have here a cutoff value of zero point 4 6 8

3 so this will give us a better accuracy of 0.7 175 remember this is based on the

table that we have seen earlier so this table was just for one situation where

cutoff is 0.5 and it tells us how the model has performed but sometimes what

can happen is instead of focusing on the overall accuracy or miss classification

we are more concerned about predicting more accurately in one group compared to

the other group for example if we have data on bankruptcies and we are trying

to make a prediction whether a company will go bankrupt represented by 1 in

that case our interest may lie more in correctly predicting one rather than

zero so that’s where we can make use of our OC curve we’ll make use of performance

and we will calculate t be our true positive rate so true positive rate

based on this data is 29 divided by 29 plus 98 positive rate is about 22% obviously

this is a very very low accuracy level for correctly predicting one most of the

times one is being misclassified as zero so obviously this one needs big

improvement when we look at the overall model and see that accuracy is 70 1.75

that looks very good but when we have to focus on one

accuracy level of 22% obviously is not very good similarly we also calculate

what is called false positive rate so false positive rate we can calculate

again from the same table here so for example 20 is falsely predicted as one

out of 20 plus 253 so false prediction rate in this case will be 20 divided by

253 plus 20 so false prediction rate is about 7% we can do this calculation and

store this in let’s say our OC because we are going to make a ROC curve so

remember these calculations are based on cut-off value of 0.5 which is default

but when we do ROC curve we’ll also be able to see how is the performance for

different cutoff values so that’s the idea we got this one now let’s plot our OC this is how the ROC curve looks like so

you have true positive rate on the y axis and we have false positive rate on

the x axis so the ideal situation should be that the curve starts somewhere here

at zero zero and goes to this one zero one and then this value which is 1 1 and

that would have classified in a perfect way or accuracy would have been

hundred-percent but in reality based on the data we get these curves which are

not really close to the ideal value we can draw a line in the middle so that

the intercept is 0 and slope B is 1 so this straight line here means without

any model if we say that out of 400 students reject everyone will be right

about 68% of the time so if the model does worse than that so this curve will

be below the line but obviously in this case the model is doing better so this

curve can be compared for different models and we can see which model is

doing better and which model is not doing good we can customize this chart

by adding few more things we can colorize

by saying true so if you run that line you will see now there is a color and

that color is based on the cutoff so for example 0.5 is somewhere here so that

light green color is for cutoff at 0.5 so cutoff values range from point 0 5 up

to 0.72 in this example we can also add a title note that while Abel here is true

positive rate another name for this is sensitivity

also xlabel we can say this is one – specificity so that’s another name for

false positive rate so if you run that you get this title here sensitivity and

one – specificity and of course we can draw this a beeline and see how the

model performs against this benchmark another way people use this ROC curve is

to calculate area under the curve because visually we can see that this

curve is doing better than the benchmark but when you have many curves on this

chart what will happen is it will be difficult to differentiate between the

performances so we need a numeric value what we do is we find area under the

curve and if the area is higher that means model performance is better so

note that for the entire rectangle here the total area is 1 and if you look at

this line area below this line will be 50% obviously area under the curve for

the model that we have built that area will be more than 50% so let’s see how

much we get so we will use performance

PR ad that we had calculated earlier we will get a you see area under curve and

let’s store this in a you see so we’ll make use of this command here

Unleashed and slot a you see that we calculated earlier why dot values so

let’s do this also in a you see so if you simply run a you see so you can see

we get point six nine two one etcetera but if you want to round this to less

number of decimals you can do a you see here you see and let’s say we’ll have

only four so now you see only four decimal values

so let’s add this number to this graph somewhere here using legend let’s say we

want the legend to start somewhere at 0.6 and 0.2 so X value is 0.6 Y value is

0.2 we want a you see we can also give a title the curve is indicated on this graph you

can also choose to change the size of this by using cex if you say 4 it will

be very big so obviously this will not fit 1.2 so let me run this line again

this one is so this is with about 1.2 size

Please provide me the data set . Email:[email protected]

Thank you so much for your explanation, I could run my code and understand better the process.

Hi sir

Can i have your Mail id …:)

Thnks sir for this wonderful video. Can u pls share the dataset and r code to may email id: [email protected]

Thanks Bharatendra Rai Sir 🙂

# ROC Curve & Area Under Curve (AUC) with R – Application Example

install.packages('aod')

install.packages('ggplot2')

library(aod)

library(ggplot2)

binary <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/binary.csv")

str(binary)

#Logistic Regression Model

install.packages("nnet")

library(nnet)

mymodel <- multinom(admit~.,data = binary)

#mis classification rate

p <- predict(mymodel,binary)

tab <- table(p,binary$admit)

tab

1-sum(253,29)/400

# Model Performance Evaluation

install.packages("ROCR")

library(ROCR)

pred <- predict(mymodel,binary,type = "prob")

hist(pred)

pred <- prediction(pred,binary$admit)

eval <- performance(pred,"acc")

plot(eval)

abline(h=0.71,v=.45)

#Identifying the best cutoff and Accuracy

eval

max <- which.max(slot(eval,"y.values")[[1]])

acc <- slot(eval,"y.values")[[1]][[max]]

cut <- slot(eval,"x.values")[[1]][[max]]

print(c(Accuracy=acc,Cutoff = cut))

#Receiver Operating Chraracteristic (ROC) curve

pred <- prediction(pred,binary$admit)

roc <- performance(pred,"tpr","fpr")

plot(roc,colorize = T,

main = "ROC Curve",

ylab = "Sensitivity",

xlab = "1-Specificity")

abline(a=0,b=1)

#AUC

auc <- performance(pred,"auc")

auc <- unlist(slot(auc,"y.values"))

round(auc,3)

legend(0.6,0.2,auc,title = "AUC",cex = .50)

Great explanation..could u please provide the R code & dataset. thank you.

Email: [email protected]

Hi Sir,

Could you please send me the data set Email :[email protected]

Excellent demonstration sir.Can you please send the R code sir. My mail id is [email protected] Also please send the data set of this explanation.

Excellent demonstration sir.Can you please send the R code sir. My mail id is [email protected] Also please send the data set of this explanation.

Hi sir ,Thanks for the video and please share the R code – [email protected]

Perfect. Sir can you please share the data set email: [email protected]

Great explanations! Would you be able to send me the dataset used and R code for this please? Thanks! Please drop at [email protected]

thanks for the video, can you please send me the dataset on [email protected] Thanks in advance

Could you please send me the data and script? Thank you very much. [email protected]

Really found this helpful. Thanks!

can I have the data set and R code, email: [email protected] Thank you

Wonderful! I would LOVE to get the data set, and code if possible. Thank you! [email protected]

wooow just wooow!!!

Great explanation!

Hi, sir, Could you please send me the R code and data set to my mail id: [email protected]

Thanks for explanation sir, can u please mail me the dataset at [email protected]

Hi Sir,

Thanks for wonderful video.

Could also make AUC video where dependent variable is continuous.

please send me dataset to [email protected]

Hi Professor, i love your videos ,it's very interesting.I'm a PhD student , sometimes i find difficult to have the link between my own variables ( concengrations of elements) and the variables that you work with, that's why ; I wish you have documents well explained concerning the data processing analysis to sends it to me I will be very grateful . Also i want that you sent me the data file.

we an use Deducer package in R to directly run the ROC Curve

library(Deducer)

mymodel <- glm(admit~.,data = binary)

rocplot(mymodel)

sir please share the dataset [email protected]

Great explanation! Could you send the data to [email protected] ? Best regards.

Awesome video!! Can you send the code to [email protected] ??

Hi sir

Your all videos are good

Simple doubt : what is the statistics behind deviding data set as train and test presentages ?

Some says 90 to 10

Some says 80 to 20

Some says 70 to 30 %

What's yous suggested presentages ? And how ?

ROC applicable only for binary classification problem ?

Short, simple and covers everything! Thank you!

uzunu sikim very good video

uzunu sikim very nice video

Hello sir, can you make a video for ploting ROC curve for SVM. I am getting an error in my code. The error is i am getting is format of prediction is invalid. Thank you

Unbelievably helpful video – I've been searching all over internet for this. Thank you.

What an amazing lecture!!! Thank you alot

Could you please provide me the data set and R code?

Email: [email protected]

Thanks again

Great video, please may you kindly send me the data and code on [email protected] thanks

how to check performance for ordinal logistic regression model. By using this method I am getting this error

'Error in prediction(pedi, heart_A$num) :

Number of cross-validation runs must be equal for predictions and labels.'

Hi , Can you please send the R file at [email protected] video..Thanks !

I'm a bit late to the party here, but sure You cannot compare total amount emitted vs accuracy of the model?

In the dataset 68% were admitted, but that data is 100% accurate. If the model is 70% accurate you could get a result between 273 +-30%. Or am I missing something here? You are comparing apples with oranges?

Great video and very well explained!

I have a question about the logistic regression model part. Does the code deal with the whole data? I thought when doing the logistic regression model, you have to divide the data into training set and test set. In the code you've used, does it divide training set and test set automatically?

sir where do i get this dataset??

Hello Sir, the video is quite informative! can you please email me the RCode and dataset : [email protected]

superb sir…phenomenal…..u make tough things look simple….proud of you boss

Sir, a)for multi-class, how you will will come with false positive, false negative b)how to compute ROC for multiclass

I get a confusion matrix that looks like (201,0; 45,0) with an accuracy of 1… HELP

Thank you for this great video! And thank you for prompt reply. I have questions.

If we are doing machine learning, we need to create ROC using predictive model created by test set, correct?

(in your "Logistic Regression with R" video, you created predictive model using test set. We need to validate the accuracy of the model). Also, if I want to use which.max func to plot the highest values on the eval plot, what code should I use?

A ROC Curve Tutorial for more than two classes with the 1 vs ALL approach

would be a very helpful video :).

Hi,

Firstly, great video this really helped me to understand the ROC curve and implement it with my data in R. I am analysing diagnostic data for a masters degree research project. I wanted to know how to identify the cutoff value from the value that we take from the accuracy versus cutoff curve or the final ROC curve. The scale goes from 0-1 but my independent variable data ranges from 100 to 10^7 . In short, how do I take the best cutoff value that this analysis outputs and relate/convert this to my independent variable and an exact cutoff value?

Thanks very much.

Great Job Sir, Would be great if you could send the Data set to [email protected]

Thank you so much Sir..the video was really helpful in providing practical knowledge of dealing with predictive modelling problems in R..Can you please tell me how to apply weight of evidence/ fine classing in R – is there any ready made syntax?

very very helpfull !!!

im sending to some brazilians friends

thx , very good video.

Thank you for your video.. Please share the dataset to [email protected]

Thank you sir…very clear and crisp explanation. In one video I got all the information. From the explanation in the video, I got how to find cutoff for maximum accuracy, by doing this only one class has got more weight in my dataset. but how to find a threshold value of cutoff(which gives maximum of sensitivity and maximum of specificity).

Great video! can you please send data set to yacoubou.[email protected]

Many thanks!

Dear Sir,

Could you send me dataset and code. My emai: [email protected]

Thanks

Hello Sir, great explanation.. could you share the data file. my email id is [email protected]

Awesome creation and explanations. Can you please send the code at [email protected]

why u wrote 'y.values' in slot function of AUC?

for those who are looking for the data: https://stats.idre.ucla.edu/r/dae/logit-regression/

I have applied the same functions in evaluation of my GAM model where I am not able to produce the confusion matrix. The results shows 2*132 table matrix instead of 2*2 matrix moreover I have 203 'Y" variable in validation data. Why its coming so. Plz help me. Thanking you.

Hi Sir, I have been watching almost all of your videos on Classification, regression.

Can you please share the dataset & r code for this video to [email protected] . Thanks a lot!

Hello Sir, are precision and recall same as sensitivity and 1-specificity respectively?

very useful video! Could you send the data and code to [email protected]? Keep going!

very helpful video sir. thank you so much. I have a doubt how do you fix the threshold value as 0.5.

Very useful video. Can u share data and code to [email protected] Tq in advance

Thanks for sharing such informative videos Sir. Appreciate. Can you please share the necessary data/code files for the session. e-mail id is: [email protected]

Thank you so much sir, Just want to ask you whether type='response' is same as type='prob' when I am trying to give type='prob' , R is throwing an error like "Error in match.arg(type) :

'arg' should be one of “link”, “response”, “terms” ?

great video , would you please send me the R code

[email protected]

Hi Sir , can you share the dataset to [email protected] please ?

How to handle if my all data is categorical my predictor features are subject columns with 1 to 8 grades for each subject and

response variable is subject where we have to predict response variable grades (1-8 ).

Before applying model I converted all features and response variable into factors is this right step or should i only covert response variable into factors and keep predictors in numerical format

Great video. Is it possible to share the data and code ? My mail id: [email protected]

Great Video, Sir can i get R code for the same. my Email id [email protected]

Hi Sir,

When I try using "prediction" function on a multinomial target variable and a matrix of predicted probabilities, I am getting the error below:

Error in prediction(preTrainProb015, train015$Delq.Status) :

Number of cross-validation runs must be equal for predictions and labels.

In the above error: preTrainProb015 <- predict(model015,train015,type='prob')

train015$Delq.Status is a multinomial target variable with levels 0, 1 and 5.

Could you shed some light on this error please? Thank you!

Hi Sir , why did you use multinom function here ? isnt multinom used only if target var have more than 2 categories ? while in this video we have only 2 categories , yes or no ?

Hi, I'm on R studio v. 1.1.423 now and nnet package isn't available and I can't seem to find an equivalent… any ideas what I can use to get the same results? Thanks.

Sir,Can I have the dataset ?

[email protected]

Thank you! Finally plot ROC

Hi

Bharatendra, why don't you use glm()? I looked it up and it seems like multinom is used when the dependent has more than 2 levels. In your example, the dependent is admin(no, yes). That's why I'm confused why you chose multinom () instead of glm(). Thank you.

Sir, can you please also attach dataset files/link along with your videos. This would greatly help us in learning by practicing with same data set. Thanks for great videos sir.

Hello, This is great video but I am slightly confused with the probability explanation. YOu mentioned that if the prob prediction score is less than .5 then chance are less than average but doesn't it depend on %age of events in the data on which model is based? If in the sample data, the event rate is 1 out of 4 then probability is .25, so any scores above .25 in the final output mean model is saying that this has higher chances. Not necessarily it has to be above .5.

Can I have the R file, please?

absolutely excellent explanation. thank you very much.

Thanks a lot sir… for such precise explanation of AU ROC curve. Truly appreciated.!

Thanks for the video! Amazing!

what does represent the color from green to red of the ROC?

That porn music in the beginnning xD

Thanks for the useful information. I would like to ask you if I can use ROC to measure the effectiveness of the prediction model? And can I use ROC in R software?

Hi Sir, i have calculated cut off for accuracy for my data (~0.475). i would like to know where exactly i should replace default 0.5 with this .475 ?

i am getting following error when i use : pred <- prediction(pred, test_data$Income)

"Number of cross-validation runs must be equal for predictions and labels."

What could be the reason?

so simple and precise really easy to understand, I have taken BABI course from great lakes, but their videos sucks, I come here to learn from this guy. Phew.

A very simple and elegant explanation of the concept. Look forward to more such concise explanations. pl share the code on [email protected]

Another great video, sir can we use these commands for categorical based models as well?

Further to my earlier comment, also wanted to ask what software you used to create these videos on data analytics. Thanks

sir , whenever I try to perform confusion matrix or roc.curve function on "TEST"data I get below error.

Error in roc.curve(test$LoanDefault, pred) :

Response and predicted must have the same length.

how to debug for this error,

kindly help.

Hi Dr. Bharatendra Rai, would you be able to make a tutorial on building a logistic regression model using training and validation sets, with performance checking via ROC curve as you have done here? I know you posted one on linear regression, but I thought a logistic model would be very helpful too. Thank you!

Hello Sir , i'm an avid viewer of your videos which truly add value to our ML understanding. Have a quick question that once we determine the Best Value of Cut Offs post Model performance evaluation , should we go back and re-run the Model performance with Best Cut Off values and change the cut off of 0.5 that we considered as thumb rule.

True Postive Rate= True Positive/Actual admit ( 29/(29+20) =0.591

Fales Positive Rate= False Positive/Actual not admit= (98/(253+98)=0.279