So in learning from text, one of the fundamental problems we have is that the length of each document that you learn from, each email or each book title, is non-standardized. So you can’t just use each individual word as an input feature, because then long emails would require different input space than short emails. Instead, one of the coolest things in machine learning from text, which analyze all these approaches, is called bag of words. And the basic idea is to take any text and count the frequency of words, which you do for me in a second. So the trick is to impose a dictionary. So you have all the words you know. Obviously, they include words like nice, very, day. It might include words that don’t occur on the left side, like he, she, love. So say these are the words you care about. Notice that there’s one on the left side that doesn’t occur on the right side. Then you can map each phrase, each email, each title, each text, into a frequency count defined over these six features over here. So let’s do this. I give you six boxes for each of the six words in our dictionary. Of all the words you care about, can you for the very first sentence, nice day, fill in the frequency count of those words in this text and give me that vector? So in particular, for each word over here, you tell me how often it occurs in nice day.

3 thoughts on “Bag of Words – Intro to Machine Learning”

  1. thanks for the video! could you tell me what's the difference between the bag-of-words model and the Vector Space Model (VSM)?

  2. Can the bag of words dictionary add new words used in the inputs? I'm trying to encode a twitter dataset, which has many hashtags that wouldn't show up in a dictionary search – but many of these tweets use the same hashtags so I'd like to have these counted by the bag of words.

Leave a Reply

Your email address will not be published. Required fields are marked *