>> The first speaker is Andrew Ng and he’s going
to speak about “Unsupervised Feature Learning and Deep Learning.” He’s done lots and lots
of great work and there’s many things I could say but he said I should just keep it short
and let him get up and give his talk because Research Interesting Machine Learning, Reinforcement
Learning and Robotic Control and he’ll share some of his work with us now.
>>NG: Okay. Thanks, Allie. Tell you a bit about Unsupervised Feature Learning and Deep
Learning. And I know that most of us in this room care primarily about computer vision
and–but I’m going to develop these ideas using computer vision as was examples drawn
from audio and text. Let’s say you want to do image classification, right. What you do
is you take an image and you feed it to learning algorithm and it learns to classify it as
a motorcycle or not. All of us notes pictures and read it like this, what we do is we actually
take the image and we construct a feature representation and it’s that feature representation
that we feed to, say, a super vector machine. More generally, this type of pipeline is pervasive
in computer vision and in other domains where there’s a sub–low-level feature is what we
actually feed to the learning algorithm. In fact, to develop more broadly across computer
perception, right, envision we don’t feed the images to learning algorithms, we feed
low-level features like Edge Detectors or Sith or other things. If you have an audio
application, you don’t feed the raw audio, you feed instead, you know, various audio
features to do, say, speaker identification. And some of the actual language processing
we do machine learning on text, you feed various text features to the applications you actually
care about. Given that this pipeline is what almost all of our learning algorithms, almost
all of our vision and other applications do, the low-level features in the middle is the
primary lens through which our learning algorithms sees the world. And so, this gives them a
certain degree of primacy. And what I want to talk about today is where did these features
come from? So let’s actually look at what we do right now, right. In computer vision,
where do we get these features from? Well, in computer vision to say the answer is over
the last few decades our community has spent a lot of effort hand designing or hand engineering
these features and as a result of several decades of work we’ve now found some, you
know, pretty good features but that was a long process. Instead we have to ask, “Can
we do better than this?” That’s vision, how about audio? In audio, the state of the art
answer is of the last few decades there’s an entirely separate community of audio researchers
that’s been designing, putting a lot of insights into coding up good audio features and some
of these are incredibly clever ideas like MFCC. That was actually like, what, 20, 30
years old. It’s actually hard to beat but again you have to ask, you know, is there
a better way to come up with features and to have a whole community of researchers trying
to design these things by hand. And for text, same thing, right? In fact, a good fraction
of NLP research today is just coming out with, say, better parsers and what a parse is good
for. Well, they’re primarily used as a feature into other things. So [INDISTINCT] extraction,
you know, you don’t actually use that for–or semantic [INDISTINCT]. Most of these things
are designed to be used as features for later learning algorithms. So, is there a better
way? Is there–is there maybe a better way to come up with maybe better features and
to do so with less of our time and our [INDISTINCT] time? The one I’m going to talk about takes
some inspiration from biology and I want to share with you some of that. Like many of
you, I tend to treat biological inspiration with great caution and often a healthy dose
of skepticism but some of this is actually really cool, so I’m going to share that with
you. So it turns out that there’s a fascinating hypothesis that a lot of human perception,
a lot of biological perception can be explained by a single learning algorithm. Okay, this
is a hypothesis so this is not really proved, but let me share with you some of the evidence
for that. And this idea has been around for several decades. Recently popularized by Jeff
Hawkins, this was an idea that’s been around for at least three decades. And that image,
that red part of your brain, right, shaded red, that’s your auditory cortex. The way
you’re understanding my words now is that your ears are taking the sound signal and
is sending the sound signal to the auditory cortex which is at the lower part of your
brain and that part of your brain is a part, you know, that figures out what I’m saying,
right, processes the a sound. So, what neuroscientists did– [INDISTINCT] MIT did was they took a
number of animals and they cut–essentially cut the wire between the ear and the auditory
cortex and rewire the brain so that the optic nerve or rather so that the signal from the
optic nerve gets routed to the auditory cortex. And once you finally view this is that the
auditory cortex learns to see, okay? And this is actually not controversial. This is replicated
in multiple labs in at least four species of animals and these animals can do behavioral
tasks. They can do visual discrimination task, you know, with the auditory cortex. The same
rewiring experiment also works for your somatosensory cortex. So you can rewire this to the part
of your brain responsible for touch processing that will learn to see with a totally different
part of your brain. And more broadly, it turns out this is evidence that–so if you look
at your auditory cortex, right, what that means is that the same piece of brain tissue
can process sight or sound or maybe touch. And if the same piece of brain tissue can
process sight or sound or touch, then maybe there’s actually one learning algorithm or
maybe there’s one algorithm that can process sight or sound or touch and can we discover
what that algorithm is. Just other quick examples on the bottom is an example of seam your tongue.
This is a–this is something called brain port undergoing [INDISTINCT]. The way it works
is that you–this is being used to help blind people see, right. So you strap a camera to
your forehead, takes a gray scale image of what’s in front of you and it impresses that
gray scale image onto a rectangular array of electros on your tongue and so that, you
know, a high voltage corresponds to a dark pixel and a low voltage corresponds to light
pixel. So that is through the array of electros in your tongue. And what ends up in this training,
we can actually–you know, you and I would be able to learn to see about tongues. Human
echolocation, so you can do this either by snapping your fingers. There’s no–this room
has good acoustics, it’s hard to tell. But by either snapping your fingers or clicking
your tongue, humans can learn to interpret the pattern of sounds bouncing off the environment
as sonar. And if you search in the Internet, there are amazing videos of this kid with
no eyeballs. Tragically, there was a kid whose eyeballs were physically removed because of
cancer. But there’s a kid with no eyeballs who can, by snapping his fingers, walk around
and not hit anything. He can ride a skateboard, shoot a basketball into a hoop. Okay. And
the examples just go on and on. There’s Haptic belt that gives humans the directions in.
You can implant a third eye in an animal who learn to use the third eye. And there’s actually
a surprising wealth of evidence that suggests, you know, to some extent you can take any
of a large range of senses and plug them into your brain and very often–not always, but
very often the brain will figure out what to do with it. So the question that excites
me is, is there some fundamental transporter that underlies perception and could we figure
out what is and uses computer vision, of course, because that’s what we want to solve. So,
just to–just to summarize this, right, with a random philosophy a little bit, I think
there are two parts–two possible approaches doing computer vision. We all know that the
adult vision system is incredibly complicated so one approach to computer vision is to try
to directly implement what the adult visual system is doing. So, you know, you can implement
features that capture different types of variance, context, capture relations between object
parts, 3D structure and then–and so on, and maybe we can implement a visual system or
if we think and we can’t–no one can prove this but if you believe–but if I believe
and it may go wrong, but if you believe that there is a more general principle, more general
computational principle that underlies perception, maybe not even just computer vision because
the more general algorithm that underlies perception maybe that discovering what that
algorithm is may turn out to be much simpler the capturing, you know, what the–what the
trained adult visual system can do, okay? So let’s talk about this research agenda.
And, of course, this research agenda–you know, I started working on this five years
ago but now this is an agenda shared by many researchers in machine learning. And I should
say there are very few people that have been working on this for, like, 30 years, okay.
So [INDISTINCT] here’s a problem I want to propose, can I give your learning algorithm
a set of images and have you find a better way to represent images than the raw input
pixels? And if you can do that, maybe I can show the same learning algorithm a bunch of
audio and have you find a better way to represent audio than in raw way forms. So, as a concrete
example, just to really, you know, formulize what I mean by that, so here’s a concrete
instantiation of problem. Given a 14X14 image patch x, one way to represent it is to use
196 real numbers, like 196 pixel and testing values. The problem I want to post is, can
you find a better vector–can you find a better feature vector to represent that 14×14 image
patch than the real pixel intensity values? Okay. And we know how to hand design things
but–but can you learn something automatically? And in order to sort of really evaluate this
framework and sort of test in the formal machine or any framework, let’s–let’s actually pose
a form or learning problem. So, this is a standard supervised leaning problem where–maybe
I give you a small number for examples of cars, small number of examples of motorcycles,
and then you learn, and in the test I say, you know, what is that at the bottom? Three
years ago, we post the unsupervised feature learning formalism; we call that self-taught
learning. In which the challenge is, in addition to your label examples, we’re going to give
you a ton of unlabelled images–maybe random Internet images, and the challenge is, can
those random images down on the bottom left, can they help you figure out that that test–that
example on the bottom right is a car. Okay. And so concretely, one natural thing you might
try to do if we can come up of an algorithm is, to take all the unlabelled images, learn
a better representation for images and then use that better representation to learn to
distinguish between cars and motorcycles. Okay. So how do we do that? So it turns out,
there’s an algorithm that was originally invented by Neuroscientist Bruno Olshausen and David
Field that actually works for this task. So, Bruno and David are neuroscientists and this
was originally conceived as a Neuroscience like a theoretical Neuroscience piece of work
for how the brain–for how– how did the brain, you know, visual cortex works. And when they
invented it, I don’t think they ever had in mind that this will be used with machine learning.
So five years ago, we proposed to use this for machine learning, and–and let me share
with you how we actually went about doing that. So, this is how Sparse coding works.
Sparse coding is an unsupervised learning algorithm and we’re going to give the algorithm
a set of images X1 through XM, okay? So each of these images, let’s say an NxN image pack
is like a 14×14 image. What Sparse coding does is it learns a dictionary of bases functions,
5135K let’s say, 64 basis functions so that each input X can be approximately decomposed
as a linear combination of your basis function [INDISTINCT] subject to the constraint that
the AJs–AJ is a real number. So, subject to the constraint that the AJ is mostly zero.
And then that’s [INDISTINCT] of terms pass. Okay. This is my only equation, the–on this
talk–let me–let me actually re-explain this in pictures. So, this is what Sparse coding
does. We train Sparse coding by taking a set of natural images and cutting out random patches
and feeding that to the learning algorithm. What the learning algorithm does is it learns
in this case 64 basis functions, 513×64 shown on the–show on the upper right and you see
that those basis factions correspond roughly to edge detect–correspond roughly to, you
know, edges, right, like, Gabor Edge Detectors. When you give the algorithm a new test example,
what it does is it selects out a small subset of your 64 basis functions and it expresses
your test example as a linear combination of just that small subset of your–your 64
basis functions, okay? And so you can think of this as saying that test example X is 0.8
times, you know, basis 36 plus 0.3 times basis 42 and so on. And because we’re using only
three out of 64 of these basis functions, this is a very sparse vector, right? Most
of the coefficients are equal to zero because of–we’ve only three none zero coefficients.
Speaking informally, what this algorithm has done is, it has taken your image X and it
has decomposed your image patch X in terms of what edges appear in the image, right?
They’re saying that your image X is about 0.8 of edge number 36 plus 0.3 of edge number
42 plus 0.5 of edge number 63. And so, speaking loosely, this is an algorithm that has “indented
edge detection” and if you look to the vector at the bottom, the A1 up to A64 equals that
vector of coefficient, that vector–all right, that vector A1 through A64 of your coefficients,
this is–this is now a new feature vector for representing my image patch and this expresses
the image in terms of what edges appear in it rather than in terms of the real pixel
intensity values. Okay. Just a couple more examples, and the–and as you can imagine
by taking an image and expressing the image and by representing the image in terms of
what edges appear in the image rather than in terms of the raw pixel intensity values,
this is a much better representation, right? For, you know, the various computer vision
tasks we want to do. And just–and so, just be really concrete, right? That if you look
at one of those vectors, the A1 through A64, you can now feed that vector into your learning
algorithm and to your support vector machine or other learning algorithm instead of feeding
them your raw pixel intensity values and hopefully this is now a better representation to do,
you know, stand division test. Okay? Finally, other authors [INDISTINCT] show that this–the
output of this algorithm is actually quantitatively similar on many dimensions, not all but quantitatively
similar to primary vision cortex–early vision in the–in the [INDISTINCT] rate. So, that
was vision. How about audio? And I’m going to mostly talk about vision but I think it’s
interesting to think more broadly about perception. So Evan Sneif , Evan was a–was a–what–post
office [INDISTINCT] with Jay McClaren but what Evan did was, she took Sparse coding
and applied it to raw audio data and this figure shows 20 basis functions that he learned
applies Sparse coding to audio. What Evan did was he then went to the cat auditory system–cat
auditory processing system and for each of the 20 basis functions, he found the closest
match–the closest analog in so the cats are in biological auditory processing system.
And so the blue functions are those learned by Sparse coding on audio and if you take
the closest matches in, you know, biology, that’s shown overly in red. Okay. And so,
this is an algorithm that on the one hand, gives a good explanation for early visual
processing, which we know, early V1 right is doing edge detection. And it gives the
good explanation for early–maybe for early audio processing, again, not on all dimensions
but some surprisingly many. It turns out–this is work done by Andrew Sacks, a student at
Stanford who worked with me. What Andrew did–Andrew Sacks did was, he actually collected touch
data, you know, following the way that animals use their hands in the wild and actually demonstrated
in terms of–that–in some respects this–this sort of model is a good explanation for some
[INDISTINCT] sensory processing as well. So, just mapping the [INDISTINCT] vision right?
We now have an algorithm that can invent edge detection, it gives us somewhat better representation
for images and the raw pixel intensity values. It turns out that the person who’s done probably
the most amazing work applying his ideas is Kai Yu, he’s going to talk more about that
and instead take this on a different direction which is, can we go beyond edges? Can we–how–how
else can we extend this? So, this–let’s see, so if you imagine in this diagram that the
lowest level are the input image. One level up, you can apply Sparse coding and learn
edge detectors. It turns out you can repeat this process and just as you can group together
pixels and form edges, you can group together edges to form more complex feature detectors
to build more and more complex features. Detecting the details of this is would either be a Sparse
version of Jeff Henson’s DBN responsible to encode there. But when you do this, let me–let
me just show you the results. So, one layer of Sparse coding and one layer of Sparse DBN,
let’s group together pixels to form edges. Okay. And again, just be really concrete about
what this means. This is my last. This basis–there are 24 basis functions here, and this one
on the upper left, that’s an edge detector, you know, detecting edges at like, what, eighty-five
degree angle. So what this visualizes is, this says that–I have one feature that says,
“Do I have an edge on that orientation?” Okay? When you repeat this process, this is the
result of training the algorithm on faces; what you find is that at the next level up
you find detectors for object parts and another level up you’ll find detectors for more complete
models of faces. So concretely, this is a feature detector that you know, fires whenever–that’s
like, maybe an eyebrow. This is the detector that fires whenever it sees what? Like, the
side of the nose. These are all visualizations of very complicated none linear functions
but this is our attempt to visualize the process of–sort of, [INDISTINCT] desirable. And you
can imagine, if you want to do various things with faces, these sorts of features could
be very useful. And all these is learned from raw [INDISTINCT] data, right? If you train
these algorithms on different object types you get different, you know, the similar sort
of edges then object part, then complete object model decomposition. For example, you train
these on cars, you get [INDISTINCT] of cars and then more complete models of cars and
if you train this sort of model on sort of what multiple object causes then, you know,
at intermediate levels, you get features that are shared amongst object causes and then
at the highest level, you get object specific causes. So this shows at least that we can
learn interesting features but what is this all good for? So let’s talk a bit about machine
learning applications. The Hollywood 2 dataset is a standard benchmark in computer activity
video–video activity recognition and, you know, the task is you watch a short snippet
of a Hollywood movie and then you recognize whether any of the large number of activities
or the character. You recognize whether–I don’t remember the exact list but things like,
you know, whether someone–whether what–two people hug, whether to shake hands, whether
to kiss, whether someone’s running, activities like that. So over the last few years, you
know, various authors referenced here have developed and used fairly sophisticated algorithm–fairly
sophisticated features to do activity recognition on Hollywood 2. And over the last few years,
there’s been substantial progress by using unsupervised feature learning, you know, the–you
actually get significantly better results than any of the previous work. Okay. That’s
images. How about audio? So it turns out that you can take the same source of algorithms
and apply them to audio. So what’s shown on top is a spectrogram and so what you can do
is take snippets of spectrograms and apply sparse coding to say spectrogram data. And
if you do that then it learns to decompose, you know, spectrogram snippets in terms of
sparse linear combinations of bases functions like that. And if you look at the bases functions
learned by sparse coding or by sparse DBNs–if you saw the sparse learning algorithms, this
is an example of a set of the first layer dictionary of bases functions learned by the
algorithm. We find that many of these bases function seem to correspond to phonemes and
so just as sparse coding apply to images–sparse coding applied to images caught in dense age
detection. The same algorithm applied to audio caught in dense phoneme detection–applied
to speech. Okay. And then you can apply this recursive idea again. So this is sparse DBN
so we take the spectrogram, throw on top of layer of sparse DBN nodes like that and then,
you know, learn additional layers of features on top and again just be really clear, right?
This is a directed–this is a graphical model and the way we’re extracting features from
this is we have–is input the spectrogram and then we’re going to compute the posterior
mean of each of these nodes at the top. So this gives me, you know, two numbers in this
representation but these numbers are–they’re my feature representation for this spectrogram
input and it’s, you know, the posterior mean of those graphical models, you know, that
gives me a vector. That’s what I feed to my learning algorithm to do, you know, various
audio tasks. So, the TIMIT benchmark is a standard benchmark in the phoneme classification
and, you know, audio researchers compete on this benchmark. I say this is not the benchmark
that audio researchers compete most intensely on of all the benchmarks but this one of those
datasets where, you know, if you do 0.1% better on this dataset, you know, you write a paper
on it. So over the last decade, there was 1.6% worth of progress on this dataset. And
a couple of years ago, you know, Homolak, a former student was able to make 1.1% progress
that’s like two-thirds of the decade of progress in–on like–a pretty significant audio benchmark.
And the thing I like about is that–so when Homolak and Peter Thomason and I were doing
this work, none of us actually know that much about audio. None of us are experts in audio
that because Homo–so Homolak, you know, not an expert in vision, not an expert in audio
but by being only an expert to machine learning was able to compete with, you know, audio
researchers on a whole term. All right. So, what are the challenges that face unsupervised
feature learning and deep learning? So there are–there are–so there’s actually a large
community of people, certainly not just me developing unsupervised feature learning and
deep learning algorithms. I’m going to share with you one of–very briefly share with you
one–what I think was one of the main technical challenges. This is just the idea of scaling
up. So, Adam Coles ran an interesting experiment of taking different–unsupervised feature
learning algorithms shown in different colors and scaling them up to different degrees.
So the different colors, the different lines correspond to different algorithms. The vertical
axis is performance and the horizontal axis is the size of these models–the number of
features you learn. And the trend is actually incredibly clear, right? The larger you can
scale up these models, the better it does. And very often even a simpler algorithm, if
you can scale it to learn more features or just learn from more data and learn more features
will do better. And so, over the last few years, I should feel like one of the main
technical challenges facing unsupervised feature learning is how do we come up with efficient
algorithms to scale up these algorithms. And I don’t really want to talk about this, but
concretely, you know, proposes formalism in, I guess what, 2006. [INDISTINCT] was the first
to do applied GPUs to this problem. Now, essentially everyone in deep learning is using GPUs and
various other ideas for scaling up these algorithms. I don’t want to talk–I actually don’t want
to go into the technical details on these today. But by scaling up these algorithms,
we were able to get state-of-the-art results on many machine learning, you know, image,
audio and so on task. So as I was preparing this talk, I asked my students to summarize
for me all the datasets on which we have a–so the state-of-the-art result meaning superior
to the best published result. And I was actually surprised. There are actually quite a lot
more of these and I realized. But it’s on audio, I talked about that already. Images–I
should say images I believe that’s one unpublished result that may have surpassed us now. But
I think we–we’re superior to all published results. The idea of–when I tried these algorithms
results on all the standard, on–you know, many of the most standard video activity recognition
benchmarks. All of these surpass the–what brought the human design features I’d say
in multimodal classification. Once you go to state–this is, you know, unsupervised
feature learning and deep learning is a rapidly growing area of ICML of machine learning.
So I’m certainly not the only one working on this. Just–other state-of-the-art type
results, I believe–I’m not 100% sure. I believe [INDISTINCT] may have the best pedestrian
detector. I’m not 100% sure of that. Phoneme recognition. Now, we were the first to get
the state-of-the-art result for audio or TIMIT. More recently Geoff Hinton has also gotten
a very–gotten a state-of-the-art result on a different phoneme recognition task. And
Kai has pretty more than–perhaps more than anyone else–gotten really amazing results
on images. So I’ll leave them to talk about that. When I give talks that where Kai isn’t
following me, I usually have one or two slides on his work but I won’t do that today. So
that’s are the states of a–you know, a lot of what we and other groups are doing in unsupervised
feature learning and deep learning. In the last of, what–how much time do I have, 10,
15, half hour, 30, 40, 50 minutes?>>No. Ten minutes and then questions.
>>NG: Okay, thanks. Thank you. So in the last 10 minutes, what won’t we just talk about
some more speculative, future looking, you know, exciting maybe pleading yourself because
the idea of learning recursive representations. And this is one direction of several that
we’re–that we’re trying to extend these ideas. So what I’m going to do is actually develop
these ideas in the context of text, and then I promise I’ll take these ideas and actually
apply them to computer vision and images again. Okay? So it turns out–let’s say I have an
English sentence, “The cat sat on the mat.” And I want to learn a feature representation
for this sentence, “The cat sat on the mat.” The–there’s a standard, you know, approach
with–what maybe standards follow us. First, we’re going to take each word and, you know,
just encode each word using a feature representation. You can think of this as a one hot vector
representation or a one on vector indicating which word it is. We usually use other representations
like an LDA representation or–there are few people [INDISTINCT] Jason Weston [INDISTINCT]
have done very exciting work learning sub-distributional similarities of have ideas. We can basically
take each word and represent each word for our feature vector. And if you want to think
of this as a one hot feature vector or think of it as an LDA or a PCA vector but there
is a–not–there are clever things but I don’t even want to talk about that. Now, you can
do is throw on top a standard, you know, generic structure like graphical model or altering
code or renewal network or whatever. But if you look at that notes, it turns out the sort
of generic hierarchy and it is so–doesn’t really make sense of text, right? Because
that note where the arrow is pointing, it sees the input “cat sat on” and so is–sort
of the job of that note to represent the phrase “cat sat on.” But the phrase “cat sat on”
is not even a proper English phrase. It doesn’t mean anything, right? So, you know, how do
you represent–it doesn’t make sense to try to represent the phrase “the cat sat on.”
It just–it just doesn’t mean anything. In contrast, because text of language has a natural
hierarchy, what we like to do is learn a representation–their respects to natural hierarchy of language.
And concretely, we know that the cat is a noun phrase, the mat is a noun phrase and
so on. And what we’d like to do is learn a feature representation, their respects to
natural past tree of the natural hierarchy that’s implied by this English sentence. And
if you look at that note, that’s called a prepositional phrase. It’s the job of that
note to represent the phrase “on the mat sitting on the mat. You can imagine, you know, “On
the mat” has a meaning. We can think about how to learn a feature representation for
the phrase, “On the mat.” And concretely, just as we were taking images like image patches
like for–and coming out with a feature vector to represent an image patch, we like to do
the same thing. We want to take any English sentence or take any English phrase and learn
a feature vector like just drawing these eight tree there as just illustration. We like to
learn a feature vector to represent, you know, the “Meaning” of each of these phrases. So,
just one half a slide on how we do this, the basic unit of computation in a learning algorithm
is–what’s called a recursive neuro network and that’s the neuro network that takes us
input–the two children notes and it outputs the feature representation of the parent note
of this hierarchy. Okay. So, Richard Socher implemented these and so allows you to recursively
compute the feature you want for the parent note as a function for the features you have
for the children note and you can train this a large scale, put this–train all these in
one, you know, discriminatively train framework where one optimization objective and run the
whole thing and it converges. So, what you can do is apply this idea to past sentences.
I should say this algorithm actually discovers the structure of the past tree and the representations
of details. So, we can do is actually apply these to past natural language sentences.
So, given a sentence like, you know, “A small crowd enters the historic church.” You can
now take as input say a representation for, “The” and a representation for, “Historic
church” and use the recursive neuro network to compute the representation for the parent
[INDISTINCT]. Each–at each note, we have a feature vector that represents that note.
And this turns out you can apply the same idea to images. So, given an image like this,
there’s a natural hierarchy, right, of this image. So, you think of it as a decomposition
of the image into parts, into object parts and sub parts. So, at the bottom, is a super
pixel segmentation of the image. And there’s actually a natural hierarchy in which, you
know, this part of the building, where my mouse point is–where my mouse cursor is pointing,
combines with the window to form like a little bit–bigger chunk of the building. These things
combine with a roof, right, to form the building and so on. And so, these have a natural hierarchy
of parsing objects, sub-parsing and many of these images. And you can apply the same recursive
neuro network formalism so that you have feature representations for the low level parts and
you can recursively combine the children in order to learn representations for every note
in this hierarchy. Okay. So, the cool thing about this is that each note in the hierarchy
has [INDISTINCT] vector representation, right? We started off with features of the bottom
level leave notes and as we merge the leave notes to form larger and larger chunks of
the image, we’re recursively computing a feature vector for every note in this tree, okay.
What you can do therefore is select a note or select a note in your tree and ask what
other notes have features similar to this note that I just selected. All right. And
this is a figure that shows that. So, on the left most patch is, you know, we selected
a note on the tree and looked at the–so, it makes a union of sub-pixels or whatever
in one note of the tree. And then we look through a large database of images and say,
“What other image patches get mapped to the most similar feature vector as this one?”
And the nine images on the right are the most [INDISTINCT]. And so you kind of see that,
you know, somehow or semantically or maybe visually similar things get mapped to similar
feature vectors. So for example, in the third row, right, that fragment of a car, well what’s
the most similar to that fragment of a car? Most of these other fragments of cars. More
importantly, you can apply this source of features to standard vision task. All this
work is much more–all this work is somewhat preliminary but Richard showed that using
these features, you can outperform, you know, many hand designed features actually pretty
sophisticated hand designed models on multi-class segmentation. This is work on the standard
background data set, it wasn’t collected by me. This is data collected by Steven Gould
as part of his thesis work. And he actually had a pretty darn complicated model for his
thesis work and that he’s learned features all perform that. Just to wrap up–just applying
the same ideas to text, we can, you know, each sentence has a feature vector representation
[INDISTINCT]. Featured recursive neuro network idea applies to text too. So what we can do
is pick a sentence which what I’m calling the centered sentence in the second column
and we can compute the feature representation for the center sentence and ask, “What other
sentences get mapped, you know, to locations similar to features base to the center sentence
and that is shown in the right column, all right. So, this example–and then so, this
example at the bottom, the things that are similar to Columbia, South Carolina, Greenville,
Mississippi, and unknown word Maryland. So, even though you know–UNK is the parses annotation
for a word that’s never seen before. So, even though it’s never seen this word before, it
knows that unknown word Maryland is probably a similar sort of thing as Columbia, South
Carolina, that’s pretty cool. Just a few more examples, all right? So, the sentences are
similar to “Hess declined to comment,” while the fourth most sentence–similar sentence
is, “Coastal wouldn’t disclose the terms.” Where, “Hess decline to comment,” it has no
words in common with the phrase, “Coastal wouldn’t disclose the terms.” But we learn
the representation for these two phrases that says that, “You know, these two sentences
are similar to each other.” It’s a useful, useful representation for other NLP task,
okay? And finally, you can evaluate this sort of algorithm, you know, more–using more traditional
NLP metrics like, how well words as a Parser. Right now, we are respectable Parser but also
not superior to the state of the art best hand designed Parsers but since [INDISTINCT]
some of more [INDISTINCT], okay? So, one last slide and then I’m done. What I’ve done is
describe the research agenda on unsupervised feature learning and deep learning. And over
the last five years I’ve been going around talking about this, you know, I’ve heard frankly
many weaknesses and criticisms of this work and there are weaknesses and criticism in
this agenda of the [INDISTINCT], just mention some of them and, sort of, acknowledge them
and maybe try to add something as well. One criticism that I often hear is this, you know,
your algorithms is trying to learn everything. But we know lots about images and so it’s
better to encode the prior knowledge we have about images. Turns out, that there was a
similar debate in NLP, in National Language Processing about 20 years ago. 20 years ago
there was a linguist if we want to–who had building pretty sophisticated linguistic theories
and then they’re like the dumb machine learning guys, you know, guys like me [INDISTINCT].
I just want to apply learning of algorithms and ton of data and see what happens. 10 years
later, it’s actually really clear which side has carried the arguments in NLP. And what
we were–what I think, you know, NLP researchers, what we realize, was that language is an incredibly
complex phenomenon and there’s no way for us to encode everything we need to know about
language. It was just–language was just so complex, you can’t write down everything you
need to know about language. And if you’re approach to solving NLP, is by hand-tuning
features a little bit of the time or if you’re force to, you know, solving NLP is to write
down ever single phenomenon and then manually add one draft co-model note for every phenomenon
you can think about in language, right? You’re basically dead. There–there’s this–language
is so complicated. If you want to this one draft note at a time or tuning one feature
at the time, it’s just–you just never going to get there. I think the debate is still
playing out in computer vision and maybe vision is a simpler phenomenon than language and
maybe this past parts of, you know, built models too complex enough capture all the
phenomenon there are in images. But again, it depends on what you believe is true for
images versus for language. Over the years, I actually heard–let’s see, over the years
I have heard a lot of criticisms on unsupervised feature learning of the form, you know, unsupervised
feature learning cannot currently do X, where X is many of those things on that list below.
So, it turns out all of these were–actually, are valid criticisms. Many of these were true
at that time but are not true anymore and, you know, I think there is certainly still
work to be done. There are sort of things at the bottom of those lists. As we strength
things off at the bottom of this list, of course, we ourselves, as well as our colleagues
and our friends help us to add more things at the bottom. So, this is not–which is the
way it should be. But–there–so, there’s certainly work to be done and there’s certainly
problems we have not yet solved. And finally, one of the criticism I hear that actually
is a secret and problem is that, we don’t understand the learned features and we’re
applying these learning algorithms, they learn something. They learn some hardly complicated
none linear function and, you know, there are people working on visualization tools,
understand these things but we certainly don’t understand them and that actually is a serious
problem. I think many vision features are also hard to understand especially when you,
sort of, use like these multiple kernel learning type methods to combine tons of other features,
you know, what is the output? I don’t know. But I think that fact that we don’t understand
the output of these algorithms actually is a problem. Okay, just to wrap up. I’ve talked
about this agenda of unsupervised feature learning and of, you know; let’s try to learn
the feature representations. Then, of course, feature learning is just a narrow technical
formalism of the problem. This problem isn’t really about learning features, right? It’s
about understanding is there some fundamental computational principle that underlies perception.
That underlies–maybe not just computer vision but that underlies perception more thoroughly.
And can we discover computational principles that underlie perception and implement that?
Because that might be a much simpler way to, you know, solve vision and make progress in
vision than to narrow the–just implement things that we think that the vision system
should do. Talked about sparse coding and the deep learning variants which has proved
very successful for various tasks and finally, I know this talk has been light on technical
details and if you are [INDISTINCT] into your usual–or if you are a researcher and you
want to use these ideas, the best way to pick these things up is actually not by getting
the details from like at the top of my mind but the best way is probably to actually to
look at the tutorial and some [INDISTINCT], you know, [INDISTINCT] read through some technical
things and [INDISTINCT] exercises and so on. Stanford students and I have been working
pretty hard on setting up online tutorials with the goal of actually bootstrapping people
up to really apply all these techniques to your problems, does an extremely preliminary
version up of that URL but we are working on the most sophisticated tutorial with–like
a–like a seriously–hopefully a serious well-designed tutorial with readings and [INDISTINCT] exercises
to really–to really get people up to speed as quickly as possible on using these things
will solve all your problems. So, email me if you’d like to be pointed to that and certainly
happy to share that. Actually, I was actually designing this for statisticians. Which is
selfish reason, I was designing it but happy to share with others as well. And then, of
course, unsupervised feature learning is a–is a large community. So, you know, if you know
any of the other guys working on this, you should definitely–I’m not–certainly not
the only proponent. You should take ideas from this community and apply it to work.
Thank you very much.>>We have time for some questions. I think
[INDISTINCT] if you can just repeat is, so [INDISTINCT]…
>>I’m just curious what happens when your data is [INDISTINCT] a little bit [INDISTINCT]
et cetera, et cetera. How well does the unsupervised feature learning really work [INDISTINCT]?
>>NG: So the question is, what happens if your data isn’t perfect? So humans can interpret
noisy images, with blurry [INDISTINCT] unsupervised feature learning. So I think it’s a broad
problem that, you know, holds true for all of computer vision, not just unsupervised
feature learning. I’m not aware that we have a huge advantage or disadvantage but we’re
driving people to work on these problems as well.
>>There’s a practical manner, you know, when you learn image features, of course, there’s
this properties like scale and invariance, [INDISTINCT] learning algorithm. Would you
recommend if somebody wants to solve the [INDISTINCT] or maybe put these into the actual feature
learning and maybe get a better scalability?>>NG: Yes, it’s very clear. Bryan’s question
is, you know, we know that we want things like scale rotation translation and variance.
So, should we put these things on the algorithms to get better scalability? So the answer to
that is subtle, I think. You know, yes, it’s sort of a good idea, in fact using convolution
on that, for example, a high close translation variance with easy bend scalability, that
sort of yes answer. Here’s a no answer, which is that, what I see many people do, right?
This is–so this notion of these invariances is actually very seductive. They are like
five or six types of invariances that we know about and can hard code. There’s a translation
rotation scale luminance, you know, maybe one of two more rate scale, did I say that?
And so, what I see many people do is you can actually turn it as possible to embark on
like a six month or two year research agenda to take these types of invariances and code
them one at a time to your learning algorithm. And you can write papers, you can keep going
but then after you’ve encoded these five or six invariances, everyone has the same pre-qual,
right? So, I think in the short term, you encode these things up, to encode these invariances
[INDISTINCT] but there are all these other types of invariances, like auto plane rotation.
So if I rotate my face this way, I’m still the same person, right? You know, a more complex–like
deformable [INDISTINCT], the fact that I move my arm and deform [INDISTINCT] I’m still the
same person. Those types of invariances we don’t know to hand code, so this is straight
off. Do you want to code up what you know or then maybe hit a brick wall and then you
need to learn things anyway or do you want to try to learn everything? And I think there
are actually valid things on both approaches. But just be aware, if you are on the agenda
of coding up the smallest of invariances that we know how to code up, everyone hits the
same brick wall. Questions? Yes?>>What are the arguments filed to [INDISTINCT]
sparse coding [INDISTINCT] so that doesn’t matter. Do you think you can get better performance
[INDISTINCT]?>>NG: Yes, so right. So, one of the reasons
the sparse coding is energetic efficiency and the brain is much more energetically–just
repeating question of mine–it’s much more energetically expensive to send a spike than
send a pulse of electricity than to not do so. And this was, I think Bruno and Davis
obviously their motivation for sparse coding. So the [INDISTINCT] context we don’t have
this constraint can we relax it? So, I don’t know. I think, as far as speech learning is
still a relatively new field and we are all trying lots of different ideas. Sparsity seems
to work our algorithms–make our algorithms work much better and for those of us, you
know, doing vision and audio, our main motivation for doing sparse coding has been the performance
of the algorithms. I don’t know, maybe there are different tools that could do even better
but–that’s actually temporal coherence is no feature analysis is one of our hypothesis,
many groups hypothesis. So, don’t–don’t really know. It turns out that there are some fascinating
work on computer hardware where it turns out to be more energetically efficient to it.
We end up using those types of hardware. So, it is a–yes, I don’t know, open field.
>>Okay, one more question or answer, cut if off then questions [INDISTINCT].
>>Yes, right. So, I… Sure, fair enough. So you are taking issue with the–where the
unsupervised and the fact that we encode, you know, the local topography of the path.
So, I ‘m happy to encode that amount of prior knowledge. So what we’re doing is, we are
essentially telling the algorithm which pixels are adjacent to which other input pixels.
This is a type of prior knowledge that we did encode into the algorithm. We haven’t
really explored algorithms, you know, that–it turns out that algorithms are much more scalable
if you actually tell it which pixels or next to which pixels does, you know, evidence in
the brain has suggested that the way your brain gets images actually captures the adjacent
information as well. It turns out these algorithms work just fine if you make some local reshuffling
of the pixels, is actually surprisingly robusted to that. But I agree, this is a–this is a
type of information that we coded to the algorithm. And with that–I mean we should wrap up and
pass it to Kai. So let me just say again, thank you very much.

21 thoughts on “Bay Area Vision Meeting: Unsupervised Feature Learning and Deep Learning”

  1. If you are interested in Machine learning, see Stanford's Machine Learning Course, here on youtube by the same speaker Andrew Ng

  2. Usually by an iterative process that first finds a set of basis vectors that sparse represent the current set of observations (images), and then finds new sparse representations (the alphas) using that set of basis functions. Rise and repeat.

  3. Yes, unsupervised learning! We have taught too much on classification, regression, SVM, ANN, that focus on supervised learning. Our brains are sub/un-consciously learning about the environment continuously, i.e. doing unsupervised learning.

  4. You might want to have a look at Chapter 6 of "Natural Image Statistics" by Hyvarinen, Hurri and Hoyer (p. 151 is your answer). It is freely available online.
    Good luck

  5. I'm really grateful to this guy, he is responsible for me learning machine learning without teacher just by looking at his courses available online. I owe you much.

  6. You should have increased the volume in the video file when the audience are asking questions. Now I have to do it myself, and then when the speaker starts talking again it becomes very painful for my ears…

    Otherwise, great talk!!

  7. When they are talking about words represented by two numbers 28:00, why is it not raw alphabet numbers representing each letter of the word just like the image example represents raw pixels?

Leave a Reply

Your email address will not be published. Required fields are marked *