Regression techniques are essential for uncovering relationships within data and building predictive models for a wide range of enterprise use cases, from sales forecasts to risk analysis. Here's a deep dive into this powerful machine learning technique.
What is regression in machine learning?
Regression in machine learning is a technique used to capture the relationships between independent and dependent variables, with the main purpose of predicting an outcome. It involves training a set of algorithms to reveal patterns that characterize the distribution of each data point. With patterns identified, the model can then make accurate predictions for new data points or input values.
There are different types of regression. Two of the most common are linear regression and logistic regression. In linear regression, the goal is to fit all the data points along a clear line. Logistic regression focuses on determining whether each data point should be below or above the line. This is useful for sorting observations into distinct buckets such as fraud/not-fraud, spam/not-spam or cat/not-cat.
Regression is a fundamental concept in most statistics. Machine learning kicks things up a notch by using algorithms to distill these fundamental relationships through an automated process, said Harshad Khadilkar, senior scientist at TCS Research and visiting associate professor at IIT Bombay.
"Regression is what scientists and enterprises use when answering quantitative questions, specifically of the type 'how many,' 'how much,' 'when will' and so on. In machine learning, it discovers any measurement that is not currently available in the data," Khadilkar explained.
Two common techniques used in regression in machine learning are interpolation and extrapolation. In interpolation, the goal is to estimate values within the available data points. Extrapolation aims to predict values beyond the bounds of existing data, based on the existing regression relationships.
Why is regression in machine learning important?
Regression is an essential concept not only for machine learning experts, but also for all business leaders, as it is a foundational technique in predictive analytics, said Nick Kramer, vice president of applied solutions at global consulting firm SSA & Company. Regression is commonly used for many types of forecasting; by revealing the nature of the relationship between variables, regression techniques give businesses insight into key issues, such as customer churn, price elasticity and more.
David Stewart, head of data science at Legal & General, a global asset manager, noted that regression models are used to make predictions based on information we already know, making them widely relevant across different industries. For example, linear regression, which forecasts a numerical outcome, could be used to gauge someone's height based on factors such as age and sex. In contrast, logistic regression could help predict a person's likelihood of buying a new product by using their past product purchases as indicators.
How linear and logistic regression work
Linear regression has a fixed or constant sensitivity to the variables it depends on -- whether that's forecasting stock prices, tomorrow's weather or retail demand. For example, a twofold change in one variable will lead to a specific deviation in the output, Khadilkar said. Many industry-standard algorithms use linear regression, such as time series demand forecasting.
Logistic regression, by contrast, focuses on measuring the probability of an event on a scale of 0 to 1 or 0% to 100%. The core idea in this approach is to create an S-shaped curve that shows the probability of an event occurring, with the event -- such as a system failure or a security breach -- being highly improbable on one side of the curve and near certain on the other.
Regression and classification
As noted, linear regression techniques focus on fitting new data points to a line. They are valuable for predictive analytics.
In contrast, logistic regression aims to determine the probability of a new data point belonging above or below the line, i.e., to a particular class. Logistic regression techniques are useful in classification tasks such as the ones mentioned above -- to determine if a transaction is fraudulent, an email is spam, or an image is a cat, or not.
The main difference between these approaches lies in their objectives. Classification is particularly useful in supervised machine learning processes for categorizing data points into different classes, which then can be used to train other algorithms. Linear regression is more applicable for problems such as identifying outliers from a common baseline, as seen in anomaly detection, or for predicting trends.
Artificial neural networks in regression
The use of artificial neural networks is one of the most important and newest approaches in regression, Khadilkar said. These approaches use deep learning techniques to create some of the most sophisticated regression models available.
"It allows us to approximate quantities with far more complex interrelationships than ever before," he explained. "These days, neural networks are taking over pretty much all forms of regression applications."
Of the approaches discussed above, linear regression is the easiest to apply and understand, Khadilkar said, but it is sometimes not a great model of the underlying reality. Nonlinear regression -- which includes logistic regression and neural networks -- provides more flexibility in modeling, but sometimes at the cost of lower explainability.
Types of regression
Regression models will obediently produce an answer, but can hide inaccuracies or oversimplifications, Kramer agreed. And a wrong prediction is often worse than no prediction. It's important to understand that one approach might work better than others, depending on the problem.
"I've been known to use the tip of the blade in my Swiss Army knife and make it work when the screwdriver would be more effective. Similarly, we often see analysts apply the type of regression they know, even when it's not the best solution," Kramer said.
Here are five types of regression and what they do best.
- Linear regression models assume a linear relationship between a target and predictor variables. The model aims to fit a straight line representing the data points. Linear regression is useful when there is a linear relationship between the variables, such as predicting sales based on advertising expenditure or estimating the impact of price changes on demand.
- Logistic regression is used when the target variable is binary or has two classes. It models the probability of an event occurring -- for example, yes/no or success/failure -- based on predictor variables. Logistic regression is commonly used in business contexts for binary classification tasks such as customer churn prediction or transaction fraud detection.
- Polynomial regression extends linear regression by incorporating polynomial concepts such as quadratic and cubic equations to format the predictor variables and capture cases where a straightforward linear relationship doesn't exist, such as estimating the impact of ad spending on sales.
- Time series regression, such as autoregressive integrated moving average, or ARIMA, models, incorporate time dependencies and trends to forecast future values based on past observations. These are useful for business applications such as sales forecasting, demand prediction and stock market analysis.
- Support vector regression (SVR) is a regression version of support vector machines and is particularly suitable for handling nonlinear relationships in high-dimensional spaces. SVR can be applied to tasks such as financial market prediction, customer churn forecasting or predicting customer lifetime value.
Applications of regression
Kramer offered the following specific applications of regression frequently used in business:
- Sales forecasting. Predicting future sales based on historical sales data, marketing expenditure, seasonality, economic factors and other relevant variables.
- Customer lifetime value prediction. Estimating the potential value of a customer over the customer's entire relationship with the company based on past purchase history, demographics and behavior.
- Churn prediction. Predicting the likelihood of customers leaving the company's services based on their usage patterns, customer interactions and other related features.
- Employee performance prediction. Predicting the performance of employees based on various factors such as training, experience and demographics.
- Financial performance analysis. Understanding the relationship between financial metrics (e.g., revenue, profit) and key drivers (e.g., marketing expenses, operational costs).
- Risk analysis and fraud detection. Predicting the likelihood of events such as credit defaults, insurance claims, or fraud based on historical data and risk indicators.
- Maintenance prediction. Predicting time to failure of critical parts and machinery.
Advantages and disadvantages of regression
Stewart said one of the main advantages of regression models is that they are simple and easy to understand. They are very transparent models, and it is easy to clearly explain how the model makes a prediction.
Another advantage is that regression models have been used in industries for a long time and are well understood. For example, generalized linear models are heavily used within the actuarial profession, and their use is well established. "The models are well understood by regulatory bodies, making it simple to have informed discussions about model implementation and associated risk, governance and oversight," Stewart said.
Their simplicity, however, is also their limitation, he said. Regression models rely on several assumptions that rarely apply in real-world scenarios, and they can only handle simple relationships between predictors and the predicted value. Therefore, other machine learning models usually outperform regression models.
In Khadilkar's view, regression provides the greatest value as a quantitative measurement, interpolation and prediction tool -- and is incredibly good at this. "Its properties are well known, and we have great ways of quantifying our confidence about our predictions as well," he said. For example, one can predict stock market prices with a specific range of possible variations around the predicted quantity.
However, there are many applications where regression is not well suited. "For example, it is less useful for recognizing faces from images. Also, it is not a fit when trying to mine data for pattern recognition or automating decisions," Khadilkar said.
"The key disadvantage of regression is possibly the fact that it only gives us a prediction of the quantity of interest without suggesting what you should do with the information," Khadilkar explained. "That is up to the human to decide."