Data touches every aspect of our lives. Every time we go to the store, visit a website, make an appointment with the doctor, travel by airplane, or take a photo; we are creating, using, and leaving behind data. The modern world is measured, mapped, and recorded in ways that were difficult to imagine even just a decade ago.
The presence of all of this data is having a profound impact on the way we live. Everything from smart phones, internet-enabled devices, self-driving vehicles, satellite imagery, and countless smaller services that get use everyday utilize data. Data allows us to make discoveries about ourselves and to make connections with an increasingly connected world.
Machine learning is one of the most powerful tools we have today to help us make sense of data. It can help us uncover what data is important and provide some insight into why. In this article, we will introduce the concept of machine learning, some ways it can be used to gain insight from data, as well as a few of the algorithms that make it is possible.
What is Artificial Intelligence, and What is Machine Learning?
Few trends have been hyped as heavily as artificial intelligence. In the popular imagination, AI conjures pictures of super computers that talk to us, killer robots, and machine companions. But beyond fantasy, these portrayals do not tell us what AI means today or how it might impact our lives. To that end, what is AI and how does it relate to machine learning? Why are they significant, and why do they garner such attention?
An AI system simply refers to a type of program that has the capacity to learn without being explicitly programmed. Programs which incorporate AI are exposed to different types of input, are capable of "acquiring experience", and are able to make choices when faced with similar choices in the future.
The types of problems that artificial intelligence can be applied are very broad. They include: predicting the price of stocks, estimating the marketability and performance of products, recognizing faces in a photo, and such as transcribing speech from a recording.
The enormous diversity of applications that can be performed with AI can make it seem intimidating. At the same time, it is this seemingly infinite spectrum of possible uses that also makes it so powerful. AI may be infused with almost any type of technology, and in doing so, allows for it to learn from its experiences. As AI becomes more capable, it will allow for the world around us to come alive and enable the types of futures we see in movies and books.
Using Data to Solve Problems, Investigate Relationships, and Detect Fraud
Data is a powerful resource that humans can use to solve their problems. We use data to recognize patterns in behavior, and with this abstraction, we can predict what events might follow.
For example, let's say you are investigating a scandal that took place in an organization. Your job is to identify employees that were potentially involved in the scandal. A few executives might already be on trial for fraudulent actions, but there may still be hundreds (or thousands) of employees involved in the affair that needs to be investigated. Because the number of potential persons of interest is too great to have human investigators explore every lead, however, might AI help to identify persons of interest and prioritize follow-up?
In November 2001, Enron, a Houston based energy company filed for bankruptcy. The company's CEO, Jeffery Skilling, along with a large staff of executives had spent years using accounting loopholes, special purpose entities, and poor financial reporting to defraud billions of dollars from the corporation. When the situation came to light, the resulting scandal destroyed the company along with its auditing Arthur Andersen and investigators found themselves in precisely the situation described above.
Can data help to provide insight on the question of "who to investigate" and might AI help to narrow down the list? The figure below shows a set of employee names, their annual salaries, and an "outcome" column indicating whether their activity appears fraudulent. Does this information give us any insight?
If we used this data to make predictions on which employees are engaged in corporate fraud, we might infer that if an employee's last name starts with an A, or if the annual salary is greater than $100,000 a year, that these are red flags which merit additional investigation. But are those appropriate conclusions?
Unlikely. With only the information above, a more appropriate conclusion is: the data is inconclusive. But what happens as we expand the pool of available information? What if instead of five individuals we are able to explore data from hundreds or thousands?
As the amount of data expands, we are better able to understand whether trends in name or salary might be indicative of fraud. But we hit a second limitation, the human mind is usually only able to consider a handful of variables. When considering name and salary, we might be able to extract some trends or patterns. But are these the best indicators of fraud? There are potentially dozens (maybe even hundreds) of variables that might be predictive of nefarious action, have we selected the best?
While the human mind may not be able to consider hundreds of potential relationships, AI is not so encumbered. Computers are happiest when confronted with enormous datasets, perhaps including hundreds of thousands or millions of rows and thousands or tens of thousands of variables. We might expand the available variables to include:
- What's the employee's position in the company?
- What tasks do they do?
- What are their employee benefits?
- How often does the employee email higher-level executives of the organization?
in addition to many other factors that might occur to us.
AI systems are capable of analyzing the variables, fitting models to the outcome being explored, and not only predicting which individuals might be persons of interest; but also explaining which features of most predictive of criminal activity.
Where Does Machine Learning Fit In?
So where does machine learning fit into all of this? The various applications of AI can be divided into several different classifications, including general purpose AI and narrow AI.
- General Purpose A.I. attempts to teach machines to understand the world as humans do. The goal is to allow a machine to think, reason, plan, learn and communicate as a person might.
- Narrow A.I., in contrast, is designed to perform very specific tasks (such as diagnosing disease from medical images, predicting stock prices, filtering spam email messages, finding the most efficient route to arrive at a location) with the highest degree of accuracy possible, or determining if an employee might be a person of interest.
Machine Learning (or ML) falls into the latter type of AI. It uses statistical techniques to create predictive models that fit a set of data that is used to "train" them. Once a model has been trained, it is capable of providing new predictions by applying the model to new sets of data.
How is Machine Learning Used?
Machine learning is used across a wide variety of industries and applications.
Types of Machine Learning
Algorithms make up the core of machine learning programs. When implemented in code as a set of rules to calculate, they let us solve data problems as a set of inputs that map to a target. The input is our data, the output is the solution that most accurately uses the data to predict an outcome. There are two primary types of approaches in machine learning: supervised learning, and unsupervised learning.
- Supervised learning is the most common type of machine learning. It can be thought of as telling the computer what type of conclusions to draw from a set of data and building systems that reach those conclusions in the most efficient manner possible. Supervised learning is often described as "putting things in buckets."
- Unsupervised learning is an approach where the computer attempts to find meaningful differences among observations of a set of data. It might be used to conclude which neighborhood restaurants are popular by observing foot traffic, tidiness, or even make assumptions of food quality; it can even group customers based on their purchasing behavior, and recommend products that similar customers have purchased in the past.
Put succinctly, in "supervised learning" we tell the computer what we think is important. In unsupervised learning, the computer tells us. In the remainder of this article, we will look at some of the most common machine learning algorithms, talk about how they work, and discuss the types of problems to which they are applied.
Terms and Definitions
Before doing so, however, let's briefly introduce some of the terms used when discussion machine learning.
- Classification: assigning a discreet value/observation in a dataset to a specific category or class. (True, False, executive, not executive, 1, 0). Classification problems have a small, or a relatively small number of outcomes.
- Example: classifying executives into persons of interest or not
- Prediction/Regression: predicting a new continuous value/observation (home prices, stocks, the value of items). In comparison to classification, predictions from regressors are part of a large range.
- Example: predicting a sales forecast based on historical trends
- Labeled Data: Information put into the algorithm. Sometimes called training data because it is used to generate the model.
Supervised learning algorithms try to model the best relationships and dependencies between the input values and outcomes of interest (targets). All supervised learning algorithms require labeled input training data. From that data, the model will pick up patterns and will create a solution (model) that can be used on new information (testing data).
Supervised Learning Algorithms
Unsupervised learning is used to create clusters of data based on a set of input data.
While supervised learning requires heavy training input from humans on determining what is correct, unsupervised learning relies on no human intervention at all. All the input data has no corresponding output values, but rather the algorithm attempts to find rules, patterns, and summaries for groups of data points.
Unsupervised learning uses "Clustering algorithms" and "Association rule learning algorithms" .
- Clustering: A "cluster" is a group of closely related data points that are clustered by similar features and values
- Association: Used to discover rules and behavior that largely apply to the data, such as viewers that watch X, tend to watch Y as well.
Unsupervised Learning Algorithms
Tooling & Libraries
Python has become the dominant platform for machine learning (and data science in general). The links below point to tools, runtimes, and libraries that can be used to get up and running with machine learning.
- Jupyter-Notebook - simple and powerful tool for computing data analysis problems. The majority of data scientists use Jupyter-Notebook
- Anaconda - powerful all-in-one data science package that includes: Python / R distribution, Jupyter-Notebook, package manager, environment manager. For more information, read our article about the Anaconda Python Distribution.
- Python - simple high-level general purpose programming language.
- R - a programming language used for statistical programming, data analysis, and machine learning
- Scikit-learn - most popular machine learning library for deploy supervised and unsupervised machine learning models.
- Tensorflow - machine learning and deep learning library that is more heavyweight than scikit-learn, best for deploying neural networks.
- Pytorch - deep learning library that supports GPU-accelerated computations.
- Pandas - most popular data extraction and preparation tool for organizing your datasets.
- NumPy - a core component of scikit-learn and pandas, best for creating multi-dimensional arrays.
- SciPy - library used to compute scientific calculations, a core component for scikit-learn.
- Matplotlib - powerful graphing library to visualize data, very useful for monitoring a model and its datasets.
Machine Learning at Oak-Tree Technologies
Oak-Tree has been developing data-driven AI applications for over a decade. The links below include more information on how AI can be applied to hard problems (such as those found in healthcare), where you can learn more, and how to get started with machine learning in Python.