By: Natasha Mashanovich, Senior Data Scientist at World Programming, UK
Part 4: Variable Selection
"Doing more with less" is the main philosophy of credit intelligence, and credit risk models are the means to achieve this goal. Using an automated process and focusing on the key information, credit decisions can be made in seconds – and can eventually reduce operational cost by making the decision process much faster. Fewer questions and rapid credit decisions ultimately increase customer satisfaction. For lenders this means expanding their customer base, taking on board less risky customers and increasing the profit.
How to achieve parsimony and what is the key information to look for? The answer is found during the next step of the credit risk modeling process – the variable selection process.
The mining view created as the result of data preparation is a multi-dimensional unique customer’s signature, used to discover potentially predictive relationships and test the strength of those relationships. A thorough analysis of the customer’s signature is an important step when creating a set of testable hypotheses based on the characteristics found in the customer's signature. Often referred as business insights, this analysis provides an interpretation of trends in customer behavior, which aims to direct the modeling process.
The purpose of the business insights analysis is to:
- Validate that the derived customer’s data is in line with business understanding. For example, insight analysis should support the business statement that customers with higher debt-to-income ratio are more likely to default;
- Provide benchmarks for analyzing model results;
- Shape the modeling methodology
Business insights analysis utilizes similar techniques to exploratory data analysis by combining univariate and multivariate statistics and different data visualization techniques. Typical techniques are correlation, cross-tabulation, distribution, time-series analysis, and supervised and unsupervised segmentation analysis. Segmentation is of special importance, as it determines when multiple scorecards are needed.
Variable selection, based on the results of the business insights analysis, starts by partitioning the mining view into at least two different partitions: training and testing partition. The training partition is used to develop the model, and the testing partition is used for assessing the model’s performance and validating the model.
Figure 1. Simplified Scorecard Model Building Process
Variable selection is a collection of candidate model variables tested for significance during model training. Candidate model variable are also known as independent variables, predictors, attributes, model factors, covariates, regressors, features, or characteristics.
Variable selection is a parsimonious process that aims to identify a minimal set of predictors for the maximum gain (predictive accuracy). This approach is the opposite of data preparation where as many meaningful variables as possible are added to the mining view. These opposing requirements are achieved using optimization; that is, finding the minimal selection bias under the given constraints.
The key objective is to find a right set of variables so the scorecard model would be able, not only to rank customers based on their likelihood of bad debt but also to estimate the probability of their bad debt. This usually means selecting statistically significant variables in the predictive model and having a balanced set of predictors (usually 8–15 is considered a good balance) to converge to a 360-degree customer view. In addition to customer-specific risk characteristics, we should also consider including systematic risk factors to account for economic drifts and volatilities.
Easier said than done – when selecting variables, there are a number of limitations. First, the model will usually contain some highly predictive variables the use of which are prohibited by legal, ethical or regulatory rules. Second, some variables might not be available or might be of poor quality during modeling or production stages. In addition, there might be important variables that have not been recognized as such, for example, because of a biased population sample, or because their model effect would be counter-intuitive as a result of multicollinearity. And finally, the business will always have the last word, and might insist that only business-sound variables are included, or request monotonically increasing or decreasing effects.
All of these constraints are potential sources of bias, which gives the data scientists a challenging task to minimize the selection bias. Typical preventive measures during variable selection include:
- collaboration with experts in the field to identify the important variables;
- awareness of any problems in relation to data source, reliability or mismeasurement;
- cleaning the data;
- using control variables to account for banned variables or specific events such as an economic drift.
It is important to recognize that variable selection is an iterative process that occurs throughout the model building process.
- It starts prior to model fitting by reducing the number of variables in the mining view to a manageable set of candidate variables;
- continues during the model training process, where further reduction is implemented as result of statistical insignificance, multicollinearity, low contributions or penalization to avoid overfitting;
- carries on during model evaluation and validation; and
- finalizes during the business approval, where model readability and interpretability play the important part.
Variable selection finishes after the "sweet spot" has been reached – meaning that no more improvement can be achieved in terms of model accuracy.
Figure 2. Iterative Nature of Variable Selection Process
A plethora of variable selection methods are available. With advances in machine learning, this number has been constantly increasing. Variable selection techniques depend on whether we use variable reduction or variable elimination (filtering), whether the selection process is carried out inside or outside predictive models; whether we use supervised or unsupervised learning; or if the underlying methods are based on specific embedded techniques such as cross validation.
|Variable selection method||Examples|
Table 1. Variable Selection Methods Typical in Credit Risk Modeling
Figure 3. Variable Selection using Bivariate Analysis
In credit risk modeling, two of the most commonly used variable selection methods are information value for filtering prior to model training and stepwise selection for variable selection during the training of a logistic regression model. Although both receive some criticism from practitioners, it is important to recognize that no ideal methodology exists as each of the methods for variable selection has its pros and cons. Which one to use and how best to combine them is not an easy task to solve and requires solid domain knowledge, a good understanding of the data, and extensive modeling experience.