>>Hello everybody, it’s a pleasure to have
Tobias with us here, and he’ll be telling
us about, I presume, Interactive Machine Learning and thinking about [inaudible] user, and what he’s done a lot
of interesting work.>>Hello everyone. Thanks for the
great introduction. And today, I want to talk
about how we can improve Machine Learning
beyond the algorithm by not working with
the algorithm itself, but also by improving
it through working with the user interface,
as he mentioned. So, in order to better understand what
I exactly mean by that, I want to start with an example. Let’s say, we have
this fictious bookstore, and we want to find out how good the inventory
of our bookstore is. So, that means we want
to find out what is the average rating
that people would give to books in our inventory. So, the good people we are, we go out and we collect data. And the way we do it is, we would show users a book along with
a start rater interface and say, rate this book on
a scale from one to five, how much you would
like to read it. So, we do this for
n randomly sample books, and after we collected
all the ratings, and in our second step, we do a very sophisticated
inference namely, we infer the mean parameter by simply averaging all the values. So, we’re done. Right? No, because we have an angry boss coming
in and she’s saying, “Your estimates
are way too noisy, there’s too much variance, I don’t know what
to do with this.” So, to put your Machine Learning
thinking hat on, and you will back, and you look at
this and you say, “Well maybe this estimator
isn’t the best, let’s improve this by using
some prior knowledge.” So, kick this out and let’s assume that our ratings
come actually from a normal distribution that has a normal prior on the mean. All right. So,
notice that this is in principle an implicit
user model because it somehow assumes how
the data was generated. And then in the second step, you would just compute the map estimate for
your mean parameter. So, that should help us reduce the variance
in our estimates but, I’m going to argue
here that this is not the only way to do this. Actually, if we take a step back and think about
what we did here in our little toy problem,
in the first part, we actually had people giving us ratings through an interface, whereas in the
second part we had this very
sophisticated algorithm to give us the mean rating. But actually, we’ve only worked really with
the second part, and my argument is
that we could have equally worked also
with the first part. And what we could
have done is also swapped out the interface
and changed it, so that we would show
a similar book to the one that users are about to
rate that they’ve previously rated along with
the star ratings he gave. So, we had, “The Hobbit,”
here and he would show, you gave, “Lord of The
Rings,” three stars. And it is known
people have done that, and you can see in end user’s studies
that this reduces also the variance in
the rating data that you get. So, you’ve achieved essentially, the same or similar
result by working with the interface instead of the the algorithm,
adjusting the algorithm. Now, this what you’ve just seen is a very crude
interactive system, you could even argue
it’s not interactive. But my point is that, even when you move to more complicated
interactive systems such as recommender systems
or search engines, the same kind of
views should hold. So, you have a back-end
Machine Learning algorithm, and you have people
that interact with the algorithm
through an interface, and both are tightly coupled
together through the data. So, data that people generate
but also kind of output, like output at predictions
from the ML algorithm. And in this talk, I want to highlight, my and our efforts to work with the two
different components here. So in the first part,
I want to talk about how we can design
better ML algorithms taking into account by CS from the users and from
the user interfaces. And then the second
part of my talk, I want to start with the people
in interfaces side and thinking about how we can design better user interfaces for more effective Machine Learning, I shaping and designing data that the algorithm
then takes as input. All right. So, in
this first part, I want to start
with a paper where we show how to
evaluate and train recommender systems
using a new approach that relies on
Castle and Friends. And as a motivating example, I want to look at
movie recommendation. In movie recommendation, we have a population of users, so here, and then popular
number of movies, and we want to recommend those movies to each user that
he or she values the most. In this two example, we have two user groups, romance lovers
and horror lovers. Romance lovers,
love romance movies but hate horror movies,
and horror lovers, love horror movies but
hate the romance movies, and both user groups are indifferent with
respect to drama movies. Now, the problem is
that in practice, we don’t get to observe
this full ratings matrix but we only get to
observe a subset. And it is well known that
users are more likely to give more ratings for
things that they like. So, we would see much more five star ratings than one star ratings
in the data. And the fact that these five star ratings
are over represented is due to what is known as selection bias and this problem
is also sometimes called, “Data Missing Not
At Random (MNAR).” And I already mentioned
that there is one source of bias that we saw that
comes from the users. So, the user-induced bias
comes from the fact that people are more likely to rate good items and or like effects that they would look at more at certain categories
more often and so on. But then there’s also
system-induced bias that comes from advertising, making certain items having a more prominent position
on the starting pay, means that users are more likely to click on it, and rate it. The obvious question is what
happens if we just ignore selection bias and apply or is centered on
machinery here? Notice that this is the kind of most frequent approach
in the literature module, a handful of
generative approaches that will compare against later. Yeah. What happens now? When we ignore selection bias, it turns out that that we
can be horribly misinformed. Suppose we want to deploy
a new recommender policy here, Y-hat, and Y-hat actually turns out to be
pretty crummy so, it recommends mostly one
star movies to users. I’ve indicated that
here as the boxes. And what happens now, if we evaluated on previously collected data
is the following, suddenly, our crummy policy looks
pretty good because the overlap with
the five star ratings is pretty significant so, we would be misled and arrive at
the wrong conclusion because of selection bias. So, that’s not good.
And a very similar thing happens also when evaluated
predicting ratings, so you complete the matrix, like in the Netflix challenge, instead of policy evaluation. And we have two systems
that we are evaluating. So, the first system ignores the two groups and
just predicts all five. It makes a large error on
the full ratings matrix. So, you would have a lot of errors here and I’ve
indicated that in deep red. Now, the second policy
is better because it recognizes that there
this diagonal structure. However, it makes
a small mistake, small error on the drama movies by predicting fives and I’ve
indicated that in light red. What happens now, if we value these systems again
on observed data? Well, for the first system
we make five errors, since there is not
a lot of five star, since there’s not a lot
of one star ratings, and then for the second system, we were making far more errors
because there is just more of the three star
ratings in the log data. So, this actually makes
the second systems performance look much worse and we would prefer
the first system here. And again, we came to
the wrong conclusion, and pick the bad system because we were
ignoring selection bias. So, that should hopefully. Yes?>>Also, five is not
as much different than three as five
is different than.>>Yeah. But, I mean, in this example, if you kind of try to sum it up then you would still see
that this would outweigh, the number would just
outweigh the difference here. But that’s a valid point. So the key idea in
our paper is to kind of solve this by connecting it to the potential
outcomes framework from causal inference and in the
potential outcomes framework, you often think about
patients getting treatments, and for each patient
you would only get to observe the outcome of the treatment which he
or she got assigned to. And the counterfactual
question is then how a patient would have fared
under a different treatment, different than the one
that you prescribe. And you can think
in a similar way about movie
recommendation at least, in the policy setting
where, now, users get prescribed movies
and you want to find out how users would have enjoyed movies different from the ones that they have
potentially watched. And in order to do that, what it turns out is that
we need to understand mainly how people were assigned
movies or treatments, and this is also called
the assignment mechanism. To make this more precise
in recommendation, this actually boils
down to knowing what the marginal probably is for a rating in our ratings matrix
to be observed. And this is also sometimes
called the propensity. And then, we can use the inverse propensity
estimator which is well known, and I mean, I think both of you are
very familiar with it, has been used in
many other settings, domain adaptation as well
as reinforcement learning. And the way it works
here is so you would sum over all individual
losses and then re-weight each individual loss
by the inverse propensity. And that gives you an
unbiased estimate of the loss. And this also extends to other performance measures
that you can write as a sum of individual losses. And on the right, you can see
the propensity matrix that I used to generate the example, the observed pattern
you saw earlier. So we had a hyperbole
here, a small one here. And if we just had used these propensities along with the IPS estimator back then, the problems that I
mentioned earlier, coming from selection bias
would have gone away. So really, all we need is
this propensity matrix here. Now, how do we get this? Well, there’s
actually two settings, and the first settings, the experimental one, we know the propensities
because they were been under our control. So we had an app
placement system, we had something that, , sarcastically put things in front of users and we just
record these probabilities. The second setting is
a little bit more intricate in the observational settings
users self select. And in that case, we need to
estimate these propensities, and that corresponds to
inferring the parameter of apparently random variables
namely the ones in the observation matrix. Note that this observation
matrix is fully observed. So, it contains the one whenever rating was
observed, and zero otherwise. And to do this estimation, we can include
side information such as user item features,
X if available. And since this is a standard
supervised learning task, we can use a variety of models such as logistic regression IP or Bernoulli matrix
factorization, et cetera. So, now that we talked
about how to fix evaluation, we can apply the same ideas to learning and our idea was
simply to couple together ERM, empirical risk minimization with the inverse propensity
scoring estimator. So now we would
pick the hypothesis, the model that performance
best under the IPS estimator, instead of the
naive one that just has an unweighted loss. To make this more concrete, the objective below is
probably all known to you. This is a standard, mean squared error loss
matrix factorization with the regularization terms, and we know how to solve
this, and scale this, and the only thing that
changes now when you use the IPS estimator is that you get this
propensity weight here. So, really not that
much but I think, conceptually, this
is a bigger step. So, just to make this framework
a little bit more clear, it’s very modular in that
first to kind of estimate and pick and estimate
a propensity model. And then in a second
step you would use your ERM objective, together with your
estimated propensities. And this is different from a generative approach where
you often kind of reason about how the data is missing and then you have
a model that explains how all that data comes about. And those two are couple
together violating variables and that makes it a little bit
more sophisticated, but we’ll also compare
against them later. So before I get to
the empirical results, I want to quickly talk about some theoretical insights
that we provide in the paper. It turns out that there’s this additional tradeoff
between bias and variance that comes from
the propensity estimation. So, if you bound
the true error with the empirical error you get a very familiar looking boundary except for the
colored terms here. So what you get is
this bias that is a penalty that tells you how
far you are off from, the estimate of propensities are off from
the true propensities. And then there’s also this variance component that comes due to the
estimated propensities. So, just to instantiate this in a naive method
would be high and bias because that is kind of mismatched but low in variance because
those are constant. But our method would be, like for perfectly
estimated ones, we would have no bias but
a higher variance, usually. And this also kind of
shows that it might be beneficial in some scenarios
to tradeoffs some bias, some lower variance
at the cost of a small bias by overestimating
the propensities. Okay. So, to evaluate whether our method improves performance we did an empirical study, on two real world data sets. The first data set was one
that we collected ourselves. People went shopping for coats, and also rated them and
then that second one is from Yahoo where people listen
to songs and rated them. And both data sets
were special in that they contained
a missing completely at random test date
set where people were actually more or less
forced to rate us, randomly sub and built, a set of items, different from
the ones that they were browsing earlier
integrating themselves. And so what we did
is we trained on the missing training
data and then evaluated on the missing
completed random test data. And we compared against the latest generative
approach that we can find from 2014. And on both data sets, the propensities were estimated via logistic regression using user item features in
the code dataset and using the base for
the song rating datasets. So the results show that our method outperforms
both kind of naive matrix factorization as well as the more
sophisticated generative model on both losses and
on both datasets. So encouraging. And so, by the end of this
kind of part I hope that it was able
to convince you that propensity is growing and he has nice one
because it’s modular, it directly optimized
the target loss and not just some locked likelihood. There is no ladened
variables and it’s usually scalable
in the same way that the original problem
was scalable in. Yes?>>So, assuming
both these datasets did not have
non-propensity scores.>>Non-propensity.>>So, the actual
presentation was not done, using randomization that you.>>Yeah. So both data sets were using the
observational setting.>>Just curious if you did
any experiment where you tested like how close can
you come to the article. If you actually have a randomized
experiment in order to.>>We have some, we have some synthetic
experiments where we, I can show these later I think, they are at the end. So, we had some
synthetic experiments where we tested
how robust this kind of, because you actually just care about the final theorem step, how robust they are to the estimated of
propensities as well, it turns out there
are quite robust. Okay. So, for the
remainder of this part, I want to talk about
a follow up project, where we apply
very similar ideas to devise Learning-to-Rank. Just to get everyone back into the
Learning-to-Rank setting, we work with
the query xi, usually. Let’s say it’s winter shoes, and then we have
a ranking algorithm in production as far, and that outputs a ranking, and then we collect click logs from that ranking algorithm. So, people click on B and
D, and we would store that. And then our task is
to take all this data. Have a learning algorithm that hopefully outputs
a better ranker. The traditional way of
doing Learning-to-Rank, is to hire judges that
annotated results. So, you would hire people and they would
go through this ranking, , for the query shoes,
and then would say, well result C is relevant, F and G is relevant. So, when you evaluate, this is also called Full
Information Learning-to-Rank. And it’s straightforward
to evaluate the new ranker, because on this inquiry, because you just
reorganize results, and then you can compute
the loss function here. Everything is known. However, it’s often more
convenient and also cheaper to work with implicit feedback
from click logs, and to be still able to learn
what you do is you assume a certain user model
and the weakest case that a click
indicates relevance. So, here user
clicked on result C, means that we assume
this result to be relevant. The other ones have
question marks, and the problem is, however, that we only
have partial information, and so when we want to
evaluate a new ranker, we can’t compute
our lost function directly, because there is
missing data and it is missing not at random, and. Was that a question?>>Yeah. Should you assume
A and B are not relevant?>>Yeah, or at least not, well, that is
something you could do, that some click models do. But, in even a simpler case
you could just say, I assume only that
a click means relevant. That’s kind of the click model that’s one of
the most simple ones. Okay, so if we want to
evaluate the new ranker, we can’t compute
the loss function here because there’s
so many question marks. And it turns out however, that we can compute it in expectation in
a similar way as before. So, just to introduce
a little bit of formal notation, we’re working with
a loss function delta here, giving the relevance labels,
binary relevance labels. It tells you how good, how good a ranking is. And in our paper, we use the sum of ranks
of the relevant documents. So, if this was the ranking
being presented, then you would end
up rank 3, rank 6, and 7, that would
give you a loss of 16. Our user model was again the very simple assumption
of a click means relevant, but that kind of posses the question
what does no click mean? Well, we could either make, , the assumptions that you made, but in general we could
reason about, well, either a user did
not observe a result, or the result is not relevant. It turns out that we don’t even, we don’t need to think
about this so much, because it’s all solved by just again knowing the assignment and
observation mechanism. And again, we need to estimate propensities
that indicate how likely it is for relevance
label to be observed. So, here we would say
that the probability of observing this
relevance label here is 0.5 and then again, we can use the IPS
Estimator to evaluate, to compute, the loss
of a new ranking. But the only sum of
the clicked results here, because of our user model, because of the
assumption, that’s where the assumption
comes in. Yes.>>Do you assume that
this probability depends, or is independent of the
location where you show that?>>This- you can plot in different kind of
propensity model here, what we did is that, it is
dependent only on the rank. So a position bias model.>>It’s like there’s
only position bias?>>In this, but
you could include, you could think
about other biases. But we only, I mean, that is the main part.>>And in your data there
is only one click per query.>>No. It’s multiple.>>There’s multiple, multiple.>>Multiple clicks per query. And then, and then.>>[inaudible] binary now?>>Yeah. Let’s all subide simplifies to kind
of that setting. And then, to get
the unbiased risk estimate for the entire dataset
for an entire ranker, you would just go
and average over all, all queries in your log. And, so we coded this up, and as an SVM-Rank, so there’s a propensity weighted
SVM-Rank which optimizes the last you saw earlier
on the previous slide, and we compared it against
a standard SVM-Rank, which didn’t use any weights, and unhand tune production
algorithm that was in place. And we tested it on an academic search
engine archive that are probably all know it. We estimate
the propensity is via a small intervention experiment where we swapped
pairs of results, and then evaluated whether or, , propensity weighted SVM-Rank is better
by interleaving the results of it
with another method, and then counting
how many clicks each method got. And what you can see here is
that our propensity weighted SVM-Rank significantly
wins against both the production
and the Naive Method, by just taking, and
that one is just taking into
account position bias. So, yeah that kind of a, concludes my first part. But, I also want
to briefly mention a bunch of observations.
Did you have a question?>>Yes. Did you just
assume that people saw everything above when
they click but nothing below?>>No, that. We didn’t even do that. We had a really Bernoulli
like each position was like a coin flip and that was
kind of the propensity.>>Yeah, but I’m curious
how strong of a baseline is that compared to
position dependently.>>How strong of a baseline that is to
position dependently.>>It is position dependent, because the coin is
different at each position.>>Yeah. Exactly. Yeah.>>What he’s asking
about is your baseline.>>The baseline. You
just could have filled in missing data as
Dave was suggesting and just assume everything
above was a true zero.>>So, I think we talk
in that paper also about, , you can use more
sophisticated click models, there’s all kinds of, , processes that you
can put into, , you can make it
more sophisticated, how you estimate
these propensities. But, I think this
really just is, I mean, it’s general enough
to incorporate these more sophisticated
propensity models. We just wanted to show that even by doing the simple adjustment, you can do much better than.>>[inaudible].>>Less sophisticated?>>Yes. We’re suggesting that don’t use
propensities at all, and just treat everything
above as a zero.>>Oh, the.>>It is a better ranker
on that paper [inaudible].>>Okay. That’s
more like the kind of the pairwise preference
is one which is, I don’t, yeah, I don’t know why we wish, I think there were
some experiments I need to, I need to look that up again. Yeah, that’s
a good point. Thanks.>>I’m also curious in archive data we’re
actually see multiple clicks, I mean, my own use
of archive searches, very much like I’m looking
for various specific paper. I find it, I read it, and I don’t know I
think others users do it in a different manner,
but I’m curious if.>>This is, yeah. This is
the full text search though. This is different from the, just where you want to
find the specific one. Notice that there is
this full text search dropbox, and then you land on
the full text search. And, there’s, this one
is usually more fuzzy, like the full text search
is not as precise, so, you would specify keywords, and then you would
get it back a ranking, and not so much matching
like in a boolean way. So, yeah, I mean, you would see people click
multiple times because, yeah, to find find something
they were interested in. Just in general
exploring. All right. All right, so I
want to move on to the second part where
as an intermediate goal, I want to, yeah, I want to talk about how we
can aim to design interfaces that allow us to obtain
more and better feedback data, all while not hurting
user satisfaction, and that is because better data will then allow us to improve our predictions. So, really I want to focus on this outline
connection here, like users and interfaces, like affecting the data that
the Machine Learning gets. In this first, yes, so how can we get
better and more feedback data? In this first project
that I want to talk about, we came up with
a new interface that allowed us to collect
a new type of feedback signal. So, I know it’s lunchtime soon, so I want to run
a little experiment with you. So, I want you to stare
at this for 10-15 seconds. Figure out what you
want to have for lunch. All rights, who knows already what he or
she wants for lunch? Whoo. Wow, we have
some, yeah, okay. Well, not too many but
some decisive people. I like it. But in general, making
a decision is hard here. Why is that? Well, there’s
a large set of options. You are probably not familiar
with all of the inventory, like what are
the options out there? And you’re also often uncertain about
your own preferences. Did you have pizza yesterday? Like are you going to go
out for Chinese tomorrow? So, it takes a lot of thinking. And our starting point
here was really to think about how we can support users to drive
the feedback generation. And obviously, in the long term, we want to provide
better recommendations. But for that, we need like that I can prove feedback first. But in the short-term,
we can think about how to reduce cognitive burden, and I’ll come back
to that in a second. So what you’ve just
seen in this example of picking something to eat is an instance of what we call session-based decision making, where your goal is
to choose one option, and the information
need is fixed. So you wouldn’t just
do something else. Examples for that are
picking a movie for tonight, searching for recipe,
comparing laptops online, planning a trip, booking
hotel, and so on. It turns out that a common
strategy employed by people in this session-based
decision making is called consideration
set formation. And basically, it
works in two steps. In the first step you
would narrow the set a large set of options down to a smaller set called
the consideration set, and then so I would
go through and I would look at well, that looks good. Oh, the quiche looks also good. And then based on
that consideration set, you would then make
a decision and so I would now reason and think well, I had pizza
yesterday for dinner, how about the quiche? So decision done. So this kind of strategy
or this kind of insight was the main inspiration
for our interface. The session-based decision
making we studied was obviously not on food choices
but on movie choices. So below is an interface
that resembles that of many online streaming providers. So you can scroll, you
can kind of filter, and get more information
by clicking on it. And our idea was to
augment that interface with what we call
a Shortlist Component so users could click
on a “Plus” button or “Drag and Drop” buttons up
there to keep track of them. So this was kind of
a list of items that are currently considering
or are interested in. And to find out whether
or how that shapes kind of the feedback data
that we want, we run a user study
where we compared the interface with
a Shortlist Component against the one without one. And the task from that we gave people and Amazon Turk was, imagine a very good friend of yours is coming to
your place to visit. After hanging out
for a while you plan to watch
a nice movie together. In this experience
you’ll be asked to select a movie to
watch with your friend. And so, we had
60 people come in, most of them PhD students, three-quarters were male,
and one-quarter female. And we randomly assigned
them to one of two flights. In the first flight, they started with
the Shortlist first, and in the second flight
it was last. And in each flight, they had to choose
a movie eight times. So each of these corresponds a new session and each session they
had a fresh set of movies of a 1,000
movies to pick from. And then we collected
the feedback data and also user feedback
through surveys. So, now we can look at, do shortlist lead to more data? That was kind of our goal
kind of in can we drive? Can we drive users somehow
to give us more data? And it’s if you look at
the movies with interactions like that it’s
the unique number of movies, then without the
Shortlist people click on 2.7 items, on average 2.8. But with a Shortlist
it’s more than twice the amount of movies. So you get 5.7 kind of positive interactions in
each session on average.>>By positive interaction, you mean like click
on details and.>>Click on the details
or shortlists it. And something,
notice that something that was Shortlist
didn’t necessarily have to be clicked on because it might have
been a known movie and then in that case
you don’t want to look up more information, but you would still shortlist it because you want
to keep track of it. You’re like, oh.
that’s a classic.>>How is that
1,000 movies sorted? What is the order when
it’s displayed to the user?>>It was recent movies first, and then by the number of stars. So by year, and then within
each year a number of stars.>>Number of stars
is the public.>>IMDb score.>>I see. Okay.>>Yes, back there.>>Two questions,
first is interface do you categorize other movies
into different genre?>>Yeah, you can.>>And how to browse 1,000?>>You can browse by
selecting the genre here.>>And second is, how is this different from something that Amazon
is doing like I can add some movie to my watch
later or my favorite list. I don’t mind now
watching right now, I just want [inaudible]
to that section.>>That’s a good question. So shortlists are
different in that they are only kind of temporary and
really tied to the session. It’s not something
that you add to your, like I want to watch
later, it’s not persistent. This is really just for
the decision making, for this single task
and not so much. I mean people hack this
in all kinds of ways, they open tabs, they add, I add stuff to
the shopping cart just to remember it even, though I don’t want to buy it. But yeah it is different,
that’s hopefully. Okay. So we have more
than twice the amount of training data but that does
also help recommendation and that’s what we
looked at next. Oh sorry.>>How is the Shortlist
constructed?>>The Shortlist is
something users actually, users come up with so they add something that
they’re interested in during a session to it. It’s empty when they start and then it’s just like a stack of books when you go to a store
that you are considering.>>Without the shortlist they
don’t have the ability to make the shortlist,
is that what you mean?>>Sorry.
>>With or without shortlist.>>Without they don’t
have that ability. They have to remember
it themselves, or write it down, or do something else. Yes.>>[inaudible] people can
put in the shopping basket without purchasing
in the end, right? Just put to shopping basket and then the time to check out, I decide which one, I
want to buy for sure.>>Right. Yeah, that’s what we what
we call it kind of a hacked way of shortlisting
and that was also the inspiration to
make this kind of an explicit interface
component rather than. I mean sure you can
find ways around it, but our idea was really to study how this influences when we
give it to people explicitly.>>Judging to
the movie category, do you think how could this be useful other domains
or in practice.>>Oh, I think having, having the ability to
shortlist stuff and remember things you’re
currently considering, I think that goes
for many, many, applies to many of
these one choice tasks, where you want to figure
and compare things. Like, choose one thing
among and then whenever you can
support that I think Shortlist are
good interface component. Whether or not it
should have this form as a visual component or something else maybe maybe it’s
smaller or like on the side, that’s I think up to
the interface designer. But I think it is important to reason about how easy
it is to add things to it and that it really supports
users in their task. You don’t want to put out
too high costs because you’re adding something to
the shopping task basket hides it again and
then you need to make an extra navigational effort to go to the shopping basket. So these are all things
to consider. Yes?>>So in the 5.71 that
you have got data, you’re counting a
one every time a person resulting
to the shortlist?>>Yes, but I would
count the unique items. So if something was
clicked on and shortlisted, it would only count as one.>>Okay.>>Yeah. So it’s not just>>So different comparison is of many clicks you
have out of the section, as you can find, between the one we shortlist
and one without.>>Why would that be?>>Because you’re
counting more clicks. If I’m just adding
things to shortlist, not clicking anything, it
would be in the other case. It would be the
same of zero clicks. And there would be no shortcuts. But you’re counting
not with shortlist.>>I’m counting with>>With 5.71.>>Yeah. I’m counting
with the shortlist because that was the whole goal, to kind of collect
a new type of feedback data.>>Okay. So even if a person
doesn’t watch anything, you will still have more data.>>Yeah.>>If it’s a little bit
less [inaudible].>>I mean, in this case,
they were forced in the end to choose a movie
to terminate the session. So in the end, they had
to choose something, whether they used a shortlist
or whether not. Yes.>>I don’t think you quite
understood the last question. It’s a little bit of an apples
and oranges comparison. I could just drag the movie into the shortlist
or I could click on it, look at it, and then move
it into the shortlist. And if I didn’t
have the shortlist and the equivalent
of the first thing, to understand
the difference, so I might just move a movie into the shortlist without
looking at it. Just imagine I had like
two levels of interest. One is like,”Oh, I don’t
immediately want to reject that movie,”
and another is like, “Oh, I’m actually seriously
considering that.” And, I might put both
in the shortlist. But if there’s
no shortlist category, then I’m only going to get to see the
second out of click.>>Yeah, exactly. That’s the point
of the shortlist, I think, to elicit
that type of feedback.>>But it’s not clear that
it’s doing anything, right? You could just be
buffering the [inaudible]>>Yeah.>>I think his next slide will.>>So I think the point is to give
more data to the algorithm, so you can do better in future. And if you do doing
stuck in your head, maybe you accomplish
your task one off. So there is less training data
for the algorithms.>>Okay, but you’re interpreting the different number
of clicks as meaningful and it’s
not [inaudible].>>No, not yet.>>Okay.
>>Not yet.>>That hopefully answers
a bunch of the questions. And, I’ll also
talk more about how users feel about the shortlist. So in the prediction task, we kind of took the movie
from each session and held it out that people
finally chose, and our goal was to rank it to the top of those inner set
of friend movies. And, the training data were all displayed movies
in the session. We used the ranking
SVM and as feedback, we said that items
that were examined, clicked on, or if available, shortlisted, should be
ranked above skipped items. We didn’t discriminate between the two types of
feedback signal, and then in the test data, we embedded the chosen movie in ninety nine random movies, and we measured
the mean reciprocal rank, which is one over the position
where that movie occurred. So 1 over 100 would be worst, and 1 is the best. So if you consider the MRR
for random baseline, and you learn from the sessions
that had no shortlist, you get a small improvement. Well, that’s not so
great, you think. But once you learn from sessions
that had that shortlist, so use that kind
of feedback data. The new type of feedback signal
that you collected, you actually are
able to increase your MMR quite significantly. So why are people willing
to put up with this? Why are they even
using the shortlist? In order to understand
this better, I want to break it down into
three smaller questions. So first, do users appreciate
the shortlist interface? Second, do shortlist
increase choice satisfaction? And third, how do users
adapt their strategies? And yes, users really
prefer their “Shortlists”. We asked them to state
a preference between the two interfaces
they interacted with, and most of them
either strongly prefer or prefer the shortlist over the non-shortlist
interface. Moreover, people use them in over 93 percent of all sessions, even though they could’ve
just skipped the shortlist. Are people more satisfied
with their choices, is another question
we looked at. So we asked them, “Which interface
you think gave you more satisfaction terms
of your final choice?” And again, most people strong prefer or prefer the shortlist interface. So that’s another positive
benefit of shortlist. And then, lastly, how did
users adapt their strategies? This was more to understand how adding new interface component also changes the way
people behave, and so what we did is, we asked to self report the strategies
that they used. And so first, good, which was the most frequent one when they didn’t
have a shortlist, was they pick
the first good item, which is called satisficing
sometimes in psychology. Track multiple is
the consideration set strategy. Track 1 is just one
best option currently, and then choosing
that in the end, and other is something else. Now, when you give
them the shortlist, then most people switch
to this strategy, to the optimizing one. So with shortlist, they
satisfice really less, and they optimize more. So you can really influence also by the interface
how people interact. And moreover, in
their statements, people reported that they
had lower cognitive loads. It was easier to keep track
of things with “Shortlists”. So “Shortlists”, I hope that I was able to motivate that shortlist
not only helped long term because
we were able to provide an improved
recommendations, but also short term we had reduced cognitive
burden on users, and we were able to increase users satisfaction.
Is that a question?>>So shortlist, people are
very familiar with, right? So there’s a habit there that goes into it because a lot of people are
familiar with shortlist. But did you test this with
a completely new UI mechanism. Not shortlist, something
very different.>>That’s a very good
question, and we haven’t. But I’m sure there’s many
other ways to support users. I mean, this consideration
set formation is just one theory
of how people, or one strategy that people employ when it comes
to making decisions. But there is other
strategies as well, and I think a good UI
should make it possible to support a multitude of
these strategies really. So thinking about,
along these lines, is very interesting. Yeah.>>Did people take longer to make their choice
with the shortlist?>>I think yes. We
found a slight increase in session length. Yes.>>It could be good
or bad [inaudible].>>Could be good or bad
depending on what you’re optimizing for as
a systems designer, but we looked outside
our perceived satisfaction. Were they happier
with what they did? Because giving up might
not make you happier, even though you’re
more effective.>>So you may have talked about this,
algorithm for a signal. The movie in the shortlist, the movie people may click on but didn’t put on the shortlist, those signals, do you
put different ways using algorithm or do you find the ways influence the
performance to algorithm?>>We didn’t use weights, but what we did
is we had another.>>Then shortlist could be
two [inaudible] could be one. There’s even no ways to play
this to maybe [inaudible].>>What we tested, what
we did in the paper is that we ran also an experiment where we said that shortlisted items should
be ranked above “Clicked”, just clicked items that
were not shortlisted. So shortlisted and clicked
is stronger as just clicked, and then that is stronger
than any of the skipped items. But that didn’t
improve performance.>>Another note I think is, maybe something interesting
that you can experiment is the number or the spots
available in the shortlist. Because maybe without challenge, people just thinking,
just [inaudible]. But having this box,
[inaudible] , people will have a mindset, “Maybe i need to pay all them. The longer that it
stays, I’m going to have more and
[inaudible] more. To see whether
that can influence We give some goal. Suddenly, people will
be feeling the goal.>>Like the big balls, when you go and get
frozen yogurt, and you’re like, “Oh this is nothing.”, And you have pay $15 for it, that’s a good idea,
it’s really cool. Just to move
a little bit further, I want to quickly talk
about other ways of collecting more and
better feedback data. And one way was to come
up with a new interface. But another avenue is to keep
the existing interface and just work with the incentives like changing certain things. And that’s what exploring these and the
following two projects. So here, we kind of
looked at how we can shape the feedback data we get inspired by Information
Foraging theory, which explains how people seek information in
unknown environments. And I’m not going to go into
too much detail about it but practically what that
meant is that we varied, we examined how two things influence users feedback,
implicit feedback. First was information
access cost, which we mapped to whether you could click on an item for
more information or hover. So hovering obviously, is lower effort because
information just pops up. Clicking require
you to open a pop up then close the pop
up, so higher cost. And then, information
sent was a second access, where we varied the length of the information descriptions. So in the weak scent
condition there was nothing shown below
the posters, just the posters. And in the strong
scent condition, you have the number of stars, the title, and the genres
along with the year.>>[inaudible]>>What? Scent? because it’s inspired by how predators were hunting for prey. So Information Foraging theory comes really from
these carnivores, and they smell something, and then they follow a lead, and if the scent is strong
enough, they follow a cue. So that’s really how
this was inspired. I don’t know, it’s just
the vocabulary they use. So what he found is, by exploring these
interventions is that, feedback quality, as we measured it in the paper,
was not affected. But feedback quantity
is something that we were able to shape. And we were able to
increase this significantly by lowering the
information access costs. So, hovering really made people more likely
to provide feedback, and we were able to increase it moderately
by weakening scent, by showing them less information
so they had to click. This is also what
you would expect. But this is another kind
of knot that allows you to just tweak the feedback
that you get. However, notice that here, people in the user survey
that we asked, people prefer the strong scent. That is because it shows
more information upfront. But from an ML perspective you might sometimes prefer showing less information so
people would also give you the feedback you want, even though they’re just
mildly interested in. So the optimal ML
and HCI design points do not have to coincide. Even though, for the short list,
that was the case. In general, it doesn’t
have to be that way. And then, this is the final project that
I want to talk about. This is this year’s
wisdom paper, where we looked at
the cost of exploration from the users perspective. Our goal was to kind of
collect feedback data, look at how we can
collect feedback data from exploration while maintaining
user satisfaction. Just to familiarize,
I mean most of you are very familiar with this, but let’s quickly talk about
the exploration trade off. Let’s say you have a user
that likes Finding Nemo. Base on that information, you could exploit that fully and recommend
movies that are very similar in order to maximize short term
user satisfaction, or you could try to explore and cover all possible users
interests so you learn more about the user in
the long run and make a diverse set of
recommendations here. I mean, there’s obviously
this trade off and we studied these this trade off
from a user perspective. And so what we did is, we created what’s called
Mix in exploration, where we mixed in items
from the exploration policy into items from
the exploitation policy, which was just
content-based recommender. And then, we looked
at how people behaved, as well as their satisfaction, and the feedback signal. So what we found is that, good news, limited
exploration is cheap. Meaning, that if the amount of exploration is not too large, impact on user satisfaction
and the feedback in quality and quantity is
minimal, not significant. But once you move beyond
a certain point then, you get this non-linear
increase in costs. So if you use some slots for
exploration, that’s fine. You go past a certain point, you really hurt users.>>Is this figure like a cartoon
illustration or [inaudible]?>>This is a cartoon
illustration. I should say at this point,
if it wasn’t clear. Because it has good and bad and I don’t know,
I haven’t found it.>>I thought you
were just hiding the Y axis because it’s too
complicated [inaudible].>>Yeah, this is just a sketch. But for more details, I’m happy to talk to you. But this is essentially
what we found. So I think this is also
a step towards better understanding how we
can gather data. Yes?>>But it’s not fair
that you have to do like the learning performance
might not be [inaudible] in the amount of exploration. And the only explorer
[inaudible] suffer. [inaudible] So it’s
that sweet spot of not hurting the users is the same as doing nothing
exploratation for learning.>>But is that
user’s satisfaction, kind of like the objective
that you want to exploit out there?>>Not for long. But I want to
maximize long term [inaudible].>>I’d rather say that
this is kind gives you an upper bound on how much, at any given point,
you should explore. And then you can obviously, the more you stay on the exploitation site
the better it will be. Once you’ve acquired
enough information. All right, so that brings
us close to the end. Just want to quickly touch upon some other work that I’ve done. Mainly on how to
evaluate systems better. So I’ve worked on evaluation of ranking functions,
of word embeddings, and then how to
combine data from multiple logging policies
to evaluate a new policy. And most importantly, I want to thank
my great collaborators, all the Cornell professors
I worked with, my MSR mentors, and all
the great Cornell students, both Undergraduate and Graduate, that I had a chance
to work with. And moving forward, I
just want to give you a brief outlook on
what my vision is. My vision is really to have a highly accurate and
useful interactive systems through a holistic design. And as I’ve argued, that requires us to think about how components of an interactive
system work together. And I’ve already laid out
some examples of how to do this but there’s plenty of other exciting questions left like on the algorithmic side, you can think about
how to learn or how to assign online experiments so that you can learn
optimally from locked data, take care of the biases there. You can think about
including other components, which is growing area, fairness, transparency and
dependability, and so on. And then, from
the user standpoint, I like thinking about putting users in control of
their predictions. Because the short list give them a way of express what
they’re interested in. But you could think even further what a good mechanism
is for users, to specify what
they’re interested in, or to give feedback to the
system in a more explicit way. That also maybe benefits
them in the short term. In general, I think, I’ve just only explored a bunch of incentives for
data generation. But there should
be a whole zoo of possible patterns and
mechanisms out there. And to wrap everything up, I hope I was able to
show that there is really multiple
complimentary pathways to improving Machine Learning
beyond the algorithm. In the first part, I talked about how to prove ML and interactive systems by understanding
the biases in the data, and then we used techniques from counterfactual inference
that allowed us to include these biases into learning to build more robust ML systems. And then the second part, we started by thinking
about how people work, and how they interact, and that allowed us to
design interfaces so that we can shape and maybe
collect new feedback signals. That I think concludes
my talk. Thanks.

1 thought on “Improving Machine Learning Beyond the Algorithm”

Leave a Reply

Your email address will not be published. Required fields are marked *