An Introductory Guide to Machine Learning
It can seem from the mainstream media that every company is working on a machine learning project that promises to increase their revenue, eliminate competition, and provide revolutionary business insights.
With big promises like that, you might wonder what machine learning is and how it can help your business.
This guide will help to answer some of your questions.
1. The Definition of MLThe definition of machine learning is not set in stone. There are many different variations out there, and their nuances depend on whom you ask and what field they work in.
1.1 Machine learning vs. AI vs. deep learning
Traditional media often portraits deep learning as a synonym for AI. That is incorrect.
Machine learning is sometimes identified as one of the main techniques to achieve true AI, but it is just a subfield of AI, not its counterpart.
Deep learning is further a subfield of machine learning, which makes both areas subfields of AI.
What is missing from this venn diagram is computer science, which would encapsulate all three fields.
Data science is another term that you might have heard in relations to machine learning.
Data science is a bit more challenging to place, as it is an umbrella term that includes machine learning, as well as other disciplines of computer science that have little to do with AI.
1.2 Machine learning in one elegant sentence
When searching for a complete definition of machine learning, one has to turn to the leading authorities in the field.
In their course on AI, University of Helsinki defines machine learning as “systems that improve their performance in a given task with more and more experience or data”.
Andrew Ng, a Stanford University professor and a Coursera co-founder, has another definition. In his widely popular online course on machine learning, Andrew defines machine learning as “the science of getting computers to act without being explicitly programmed”.
That is, as you will learning in the guide, not entirely accurate, as a big part of the machine learning process requires much effort from data scientists.
Another prominent figure in the machine learning field is Tom Mitchell.
His definition of machine learning came out already in 1997. He explained machine learning with a problem, which read:
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”.
What can you take from these definitions? First of all, machine learning revolves around computers progressively getting better at problem-solving.
Secondly, the problem-solving process should happen as autonomously as possible.
All this is achieved through the use of various algorithms that are designed to enable learning and improvement over time when exposed to new data. It is the algorithms that enable a computer to make data-driven decisions.
Let us now peek behind the curtain and see how machine learning processes work.
2. How Machine Learning Actually WorksA lot of information online about machine learning talks about what a machine learning algorithm can do, but not a lot about how it can do it.
2.1 What machine learning process looks like
The flowchart below is a good representation of all the steps in a machine learning process. It might look complicated at first but fear not.
It is sufficient to know only a few basic steps to get an entry level understanding of machine learning.
Step 0: Decide on a specific business problem to solve
This may seem simple, even trivial, yet surprisingly many companies fail in their efforts to get value out of machine learning simply because they don’t ask the right questions.
The kind of business problem you want to solve with the help of machine learning determines everything in the steps that follow. From the type of algorithm you use, the data you gather, to the metrics you’ll use for performance evaluation of your model.
Step 1: Gather and prepare the data
Once you know what problem you need to solve, you can start collecting the data you need. For example, if your company wants to detect clients that are at risk of churn, you might need to collect data about their purchase activity, customer service interactions, basket size, etc.
Each of these factors is regarded as one feature of the data. The more complicated the problem at hand is, the more features a data set will need to have.
Data preparation process consist of actions like removing duplicates, formatting the data sets, randomizing the order of the data entries, and checking for data imbalances.
Step 2: Choosing a machine learning algorithm
There are many different algorithms that each fit different problems. Some are better suited to work with images, other with sounds or text files. Size and quality of your data will also play a deciding role. For example, a data set with a lot of features will require longer training time, because the algorithm needs more time to make sense of the data.
What algorithm you should choose also heavily depends on what you want to do with the results the algorithm will produce. How accurate does your result need to be?
How long time do you have to train the model? As training time and accuracy as closely related – one tends to go down with the other – answers to these questions will push you in the directions of different algorithms.
A good data science project will also use several types of algorithms to build robust models that help companies make decisions across a variety of business challenges.
Step 3: Training the algorithm
In this step, an algorithm becomes your model. By introducing the chosen algorithm to your training data set, it learns the patterns and correlations between the data, thereby establishing rules for future use.
As a rule of thumb, you should divide your data into a training set (about 70-80% of your data) and a validation set (remaining 30-20%). The training time will depend on the complexity of the algorithm and the amount of data you have.
Step 4: Evaluation of the model
The validation data set you set aside in the previous step will come in handy now. Because the model has never seen this data before, the performance of the model on this data set will be an indicationof how well the model will work in real life.
What metrics should you choose to evaluate the performance of your model on training data? The answer is: it depends. Dean Abbott put it best in his book “Applied predictive analytics”:
“If the purpose of the model is to provide highly accurate predictions or decisions to be used by the business, measures of accuracy will be used. If interpretation of the business is what is of most interest, accuracy measures will not be used; instead, subjective measures of what provides maximum insight may be most desirable.”
2.2 Who performs the machine learning process
All of the steps mentioned above are done by a group of data scientists.
Ideally, each step will be executed by different data scientists who are specialized in their respective areas.
If you have trouble wrapping your head around why you might need a whole team of data scientists instead of just one, think of a machine learning project as a menu at a restaurant.
A menu would typically include starters, a main, a dessert, and maybe some snacks. All of these dishes require different skill sets to prepare.
Baking is not the same as making a main, savoury dish. And even there, there are huge differences in how you prepare meat, seafood or vegetables.
You can be good at all of these dishes, but that would make you an outlier, not the norm.
In the same fashion, each step of the machine learning process calls for different skill sets.
3. Types of machine learning problemsOften, when you read articles about machine learning, you will stumble upon a question or a phrase that refers to “machine learning problems”.
3.1 Supervised learning
In supervised learning, you begin with a labeled data set to show the algorithms what the correct output should look like.
The term supervised itself refers to the fact that data scientists need to tell the algorithm what they want it to predict The job of the algorithm is to learn the patterns in the data and when introduced to a data set different from the training set, make correct predictions on its own.
For that reason, supervised learning is sometimes referred to as predictive modeling. Supervised learning can be further divided into two groups: classification and regression.
Both can be applied to the same questions, but they will produce vastly different results.
For example, let’s take the problem of determining tomorrow’s weather. A classification algorithm will produce the answers “Hot” or “Cold”, while a regression algorithm will predict a value for the temperature that day.
So, what does this example tell us about these two concepts?
3.1.1 Classification problem
Classification is about predicting a discrete value, such as “yes/no”, “spam/not spam”, or “dog/muffin”. In other words, you are asking a model to group data entries together in two (or more) groups.
3.1.2 Regression problem
Regression, on the other hand, is about predicting a continuous value. In other words, you aim to predict a number that ranges between – infinity and + infinity. In doing so, the regression algorithm also estimates the relationship between two or more variables.
To solve a regression problem, you need a set of data with predictor (explanatory) variables and a continuous response variable (outcome or target).
Once the underlying relationship (or lack thereof) is uncovered, it can be applied to new data sets in the future to make real-life predictions. Unlike the classification problem, there are many different regression types.
In its purest form, regression shows the relationship between one independent variable (X) and a dependent variable (Y), as in the formula below:
All regression models start off with this formula and get progressively more elaborate as we increase the number of independent variables, complexity to the data distribution, and look at different types of dependent variables.
Here is a short list of types of regression you will encounter as you learn more about machine learning:
- Linear regression
- Logistic regression
- Polynomial Regression
- Gradient Descent
- Ridge Regression
- Lasso Regression
- ElasticNet Regression
3.3 Unsupervised learning
In unsupervised learning, you start with data sets without labels or description and ask the algorithm to find the structures in the given data.
Unsupervised learning is mainly used to find patterns, rules, and groups, which show meaningful insights and describe the data better to whoever needs to use it. In other words, you use unsupervised learning when you don’t know what the data can tell you.
Often, the algorithm might be able to teach you new things after it learns patterns in data. Unsupervised learning can also be used to tackle more complex data sets that can’t simply be clustered into clear groups or patterns.
This is often referred to as “the cocktail party” problem, and in such cases, unsupervised algorithms are used to find structure in a chaotic environment. The name of this problem stems from the famous example that was used by Andrew Ng in his Stanford course on machine learning.
In his example, Andrew plays a recording of a man speaking while some music was playing. Two microphones were used to tape the recording:
- Mic 1: http://cnl.salk.edu/~tewon/Blind/Demos/rsm2_mA.wav
- Mic 2: http://cnl.salk.edu/~tewon/Blind/Demos/rsm2_mB.wav
The result of applying an unsupervised algorithm to these recording resulted in two new recording where the voice and the music were clearly separated.
- Voice only: http://cnl.salk.edu/~tewon/Blind/Demos/ssm1.wav
- Music only: http://cnl.salk.edu/~tewon/Blind/Demos/ssm2.wav
As ordinary as it may sound to you, the fact that a simple algorithm was able to separate between different audio wavelength without instruction is incredible.
3.4 Reinforcement learning
Reinforcement learning can best be described by the saying “learning by doing”. This subfield of machine learning teaches a computer about its environment by allowing it to perform actions and see the results.
The idea behind reinforcement learning is that a computer can learn from the environment by interacting with it and receiving rewards for performed actions.
This closely resembles how you learned as a child, where you will modify your future actions based on the incoming feedback from your current ones.
The goal of reinforcement learning is to maximize the expected cumulative reward. This is based on the Reward Hypothesis that states that all goals can be described by the maximization of the expected cumulative reward where long-term rewards get less weight than the short-term rewards, because the probability of getting the long-term results is lower.
There are two types of reinforcement learning tasks: episodic or continuous. An episodic task has a start and an end, for example, a game of Super Mario Bros. The computer receives feedback at the end of each episodic task, and adjust the behavior for the next task accordingly.
A continuous task runs until a human terminates it, and the feedback is constantly evaluated by the computer. A good example of a continuous task is a stock market trading algorithm.
3.5 Deep learning
It’s impossible not to mention deep learning when talking about machine learning. It’s a subfield of machine learning that is making some of the most interesting breakthroughs.
However, it is difficult to talk about deep learning in great details in this guide, as it is a subfield that deserves its own article.
What is deep learning? What sets deep learning algorithms apart from other machine learning algorithms are their capabilities.
Basic machine learning models can learn progressively, but they still need training and well-labeled data. If a machine learning model makes an inaccurate prediction, a data scientist needs to examine the problem and make adjustments accordingly.
A deep learning model, on the other hand, can train itself and determine on their own if the results are accurate or not.
Deep learning is the closest you can come to the futuristic promised of AI, and some of them are already a reality.
4. Types of The Most Popular ML AlgorithmsNow that you are familiar with the different types of machine learning, we can talk about the stars of the show - the algorithms.
Naive Bayes Classification
Boosting and AdaBoost
5. Challenges And LimitationsWhat the most pressing challenges of machine learning depend on whom you ask. A data engineer might have a different answer than a data analyst because they approach machine learning problems from different angles.
5.1 The need for data
In many ways, machine learning algorithms are worse than babies. They need to be taught everything from scratch, which in machine learning terms translates to the need for lots and lots of data.
The problem is the cost of collecting and processing all that data. The majority of machine learning algorithms that exist are supervised, which mean that the data need to be properly labeled and converted to a single format.
The latter is seldom the case in an average company. Most companies have their data in PDFs, Excel sheets, online database and even paper.
Formatting all of these documents into a uniform format takes time.
Labeling data can be costly too. Annotating and labeling of data needs to be done by hand, and the more intricate problems you work with, the high will be the cost.
For example, annotating pictures of different animals can be done by practically anyone. In other words, it would not be a high-paying job.
Reviewing and labeling MRI scans of tumors, on the other hand, will require a trained eye of a health practitioner who is familiar with tumors and how they show up on the scans.
Such a professional will require higher pay.
5.2 Black box AI
The black box AI applies largely to deep learning. Nonetheless, this challenges deserves a spot on this list as it touches upon one of the most fundamental things in human society: trust.
As of today, no one really knows how the most advanced deep learning algorithms arrive at the solutions to the problems they are asked to solve. Not even their creators.
Even the data scientists who build these models may struggle to identify the reason for any single action.
This may become a big problem in many industries as the companies who use deep learning algorithms to make decisions won’t be able to explain the reasoning behind them.
The inner workings of deep learning algorithms must be made more understandableto their users and creators for several reasons.
Firstly, to keep the tech accountable.
Secondly, to predict when failures might occur.
Lastly, without the why, people won’t be able to trust the technology to make decisions that are in their best interest.
5.3 The hype
The field of machine learning, AI in particular, is in a state of overhype at the moment.
While it makes it easier for new and experimental projects to get funding, the hype also introduced some problems.
Firstly, we have the expectations. People’s expectations of what machine learning can do far exceed what is possible today. Gary Marcus, a professor of psychology and neural science, and a former director of Uber AI Labs, addressed the negative effects of machine learning hype in his widely-discussed paper. He says
“One of the biggest risks in the current overhyping of AI is another AI winter. [..] When high-profile figures […] promise a degree of imminent automation that is out of step with reality, there is fresh risk for seriously dashed expectations. Executives investing massively in AI may turn out to be disappointed. Already, some major projects have been largely abandoned, like Facebook’s M project.”
Another area where the hype is having a negative impact is the access and the cost of acquiring talent.
With the role of data scientist voted “the sexiest job of the 21st century”, many people without the necessary skill sets have transitioned into the field.
This has made it harder to find people with the technical ability to understand and implement it machine learning algorithms in a proper way.
The cost and difficulty of finding the right people for the job put a lot of machine learning projects on hold, thereby creating a backlog of exciting machine learning discoveries.
Peter Kudlacek is a CEO at Apro Software . He has been in software development business for the last 15 years. He succesfully built several IT companies.