Hello World - Machine Learning Recipes #1
Six lines of code is all it takes to write your first Machine Learning program. My name's Josh Gordon, and today I'll walk you through writing Hello World for Machine learning. In the first few episodes of the series, we'll teach you how to get started with Machine Learning from scratch. To do that, we'll work with two open source libraries, scikit-learn and TensorFlow.
We'll see scikit in action in a minute. But first, let's talk quickly about what Machine Learning is and why it's important. You can think of Machine Learning as a subfield of artificial intelligence. Early AI programs typically excelled at just one thing. For example, Deep Blue could play chess at a championship level, but that's all it could do.
Today we want to write one program that can solve many problems without needing to be rewritten. AlphaGo is a great example of that. As we speak, it's competing in the World Go Championship. But similar software can also learn to play Atari games. Machine Learning is what makes that possible. It's the study of algorithms that learn from examples and experience instead of relying on hard-coded rules. So that's the state-of-the-art. But here's a much simpler example we'll start coding up today. I'll give you a problem that sounds easy but is impossible to solve without Machine Learning.
Can you write code to tell the difference between an apple and an orange?
Imagine I asked you to write a program that takes an image file as input, does some analysis, and outputs the types of fruit. How can you solve this? You'd have to start by writing lots of manual rules. For example, you could write code to count how many orange pixels there are and compare that to the number of green ones. The ratio should give you a hint about the type of fruit. That works fine for simple images like these. But as you dive deeper into the problem, you'll find the real world is messy, and the rules you write start to break. How would you write code to handle black-and-white photos or images with no apples or oranges in them at all?
In fact, for just about any rule you write, I can find an image where it won't work. You'd need to write tons of rules, and that's just to tell the difference between apples and oranges. If I gave you a new problem, you need to start all over again. Clearly, we need something better. To solve this, we need an algorithm that can figure out the rules for us, so we don't have to write them by hand. And for that, we're going to train a classifier.
For now you can think of a classifier as a function. It takes some data as input and assigns a label to it as output. For example, I could have a picture and want to classify it as an apple or an orange. Or I have an email, and I want to classify it as spam or not spam. The technique to write the classifier automatically is called supervised learning. It begins with examples of the problem you want to solve.
To code this up, we'll work with scikit-learn. Here, I'll download and install the library. There are a couple different ways to do that. But for me, the easiest has been to use Anaconda. This makes it easy to get all the dependencies set up and works well cross-platform. With the magic of video, I'll fast forward through downloading and installing it.
Once it's installed, you can test that everything is working properly by starting a Python script and importing SK learn. Assuming that worked, that's line one of our program down, five to go. To use supervised learning, we'll follow a recipe with a few standard steps. Step one is to collect training data. These are examples of the problem we want to solve.
For our problem, we're going to write a function to classify a piece of fruit. For starters, it will take a description of the fruit as input and predict whether it's an apple or an orange as output, based on features like its weight and texture. To collect our training data, imagine we head out to an orchard. We'll look at different apples and oranges and write down measurements that describe them in a table.
In Machine Learning these measurements are called features. To keep things simple, here we've used just two -- how much each fruit weighs in grams and its texture, which can be bumpy or smooth. A good feature makes it easy to discriminate between different types of fruit. Each row in our training data is an example. It describes one piece of fruit. The last column is called the label.
It identifies what type of fruit is in each row, and there are just two possibilities -- apples and oranges. The whole table is our training data. Think of these as all the examples we want the classifier to learn from. The more training data you have, the better a classifier you can create. Now let's write down our training data in code. We'll use two variables-- features and labels. Features contains the first two columns, and labels contains the last.
You can think of features as the input to the classifier and labels as the output we want. I'm going to change the variable types of all features to ints instead of strings, so I'll use 0 for bumpy and 1 for smooth. I'll do the same for our labels, so I'll use 0 for apple and 1 for orange. These are lines two and three in our program. Step two in our recipes to use these examples to train a classifier.
The type of classifier we'll start with is called a decision tree. We'll dive into the details of how these work in a future episode. But for now, it's OK to think of a classifier as a box of rules. That's because there are many different types of classifier, but the input and output type is always the same. I'm going to import the tree. Then on line four of our script, we'll create the classifier. At this point, it's just an empty box of rules.
It doesn't know anything about apples and oranges yet. To train it, we'll need a learning algorithm. If a classifier is a box of rules, then you can think of the learning algorithm as the procedure that creates them. It does that by finding patterns in your training data. For example, it might notice oranges tend to weigh more, so it'll create a rule saying that the heavier fruit is, the more likely it is to be an orange.
In scikit, the training algorithm is included in the classifier object, and it's called Fit. You can think of Fit as being a synonym for "find patterns in data." We'll get into the details of how this happens under the hood in a future episode. At this point, we have a trained classifier. So let's take it for a spin and use it to classify a new fruit. The input to the classifier is the features for a new example. Let's say the fruit we want to classify is 150 grams and bumpy.
The output will be 0 if it's an apple or 1 if it's an orange. Before we hit Enter and see what the classifier predicts, let's think for a sec. If you had to guess, what would you say the output should be? To figure that out, compare this fruit to our training data. It looks like it's similar to an orange because it's heavy and bumpy. That's what I'd guess anyway, and if we hit Enter, it's what our classifier predicts as well.
If everything worked for you, then that's it for your first Machine Learning program. You can create a new classifier for a new problem just by changing the training data. That makes this approach far more reusable than writing new rules for each problem. Now, you might be wondering why we described our fruit using a table of features instead of using pictures of the fruit as training data.
Well, you can use pictures, and we'll get to that in a future episode. But, as you'll see later on, the way we did it here is more general. The neat thing is that programming with Machine Learning isn't hard.
But to get it right, you need to understand a few important concepts. I'll start walking you through those in the next few episodes. Thanks very much for watching, and I'll see you then.