By: Natasha Mashanovich, Senior Data Scientist at World Programming, UK
"Garbage in, garbage out" is a commonly used axiom in computer science and a threat to a project’s success – the quality of output is largely determined by the quality of input. Therefore, Data preparation is a key aspect of any data mining project including development of a credit scorecard. This is the most challenging and time consuming phase of the CRISP-DM cycle. At least 70%, sometimes more that 90%, of total project time is dedicated to this activity. It involves data collection, combining multiple data sources, aggregations, transformations, data cleansing, "slicing and dicing", and looking at the breadth and depth of data to get a clear understanding and to transform quantity of data into quality of data so that we can prepare confidently for the next phase – Model Building.
The previous article in this series, Credit Scorecard Modelling Methodology discussed the importance of a model design and identified its main components including unit of analysis, population frame, sample size, criterion variable, modelling windows, data sources and data collection methods. Careful consideration of each of the components is imperative for successful data preparation. The final product of this stage is a Mining View encompassing the right level of analysis, modelling population, independent and dependent variables.
|Mining view component||Application scorecard case study example|
|Unit of analysis||Customer level|
|Population frame||Loan applicants with previous bad debt history|
|Sample size||Through the door applicants during 2015 and 2016|
|Data sources||Credit bureau data, applicant data, age debt history|
|Independent variables||Mixture of nominal, ordinal and interval data, such as aggregated values, flags, ratios, time and date values|
|Dependent variable||Default status (1 or 0)|
|Operational definitions||Default: 90 days past due|
|Observation window||Historical credit bureau customer information over the period of three years|
|Performance window||One year|
Table 1. Model Design Components
"The more, the merrier" – As part of data understanding step, any external and internal data sources should provide both quantity and quality. Data utilised must be relevant, accurate, timely, consistent and complete, while being of sufficient and diverse volume to provide a useful result in analysis. For application scorecards where there is limited amount of internal data, external data has prevalence. In contrast, behaviour scorecards utilise more of the internal data, and are typically superior in terms of predictive power. The common data sources that would be required for customer verification, fraud detection or credit grant are outlined below.
|Source||Category||Supplied by authority|
|External||Address, postcode||Credit Bureaux|
|Electoral roll data|
|Court and insolvency|
|Generic bureau scores|
Table 2. Data Sources Diversity
The data preparation process starts with data collection, commonly referred as ETL process (Extract-Transform-Load). Data integration combines different sources using data merging and concatenation. Typically, it requires manipulation of relational tables using a number of integrity rules such as entity, referential and domain integrity. Using one-to-one, one-to-many or many-to-many relationships, the data is aggregated to the desired level of analysis so a unique customer signature is produced.
Figure 1. Data Preparation Process
Data exploration and data cleansing are mutually iterative steps. Data exploration includes both univariate and bivariate analysis and ranges from univariate statistics and frequency distributions, to correlations, cross-tabulation and characteristic analysis.
Figure 2. EDA (Univariate View)
Figure 3. EDA (Characteristic Analysis)
Following exploratory data analysis (EDA), the data is treated in order to increase the quality. Data cleansing requires good business and data understanding so the data can be interpreted in the right way. This is an iterative process designed to remove irregularities and replacing, modifying or deleting these irregularities as appropriate. Two major issues with unclean data are missing values and outliers; both can heavily affect the model accuracy and careful intervention is therefore imperative.
Before a decision is made on how to treat missing values, we need to understand the reason for missing data and understand distribution of missing data so we can categorise it as:
- Missing completely at random (MCAR);
- Missing at random (MAR), or;
- Missing not at random (MNAR).
Missing data treatment often assumes MCAR and MAR, while NMAR are more difficult to deal with. The list below provides the common treatments ordered by complexity.
|Missing Data Treatment||Description|
|Leave missing data|
|Delete missing data|
Table 3. Missing Data Treatments
Outliers are another "beast" in our data as their presence can violate statistical assumptions under which we develop a model. Once identified, it is important to understand the reasons for having outliers before applying any treatment. For example, outliers could be valuable source of information in fraud detection; hence, it would be a bad idea to replace them with a mean or median value.
Outliers should be analysed using univariate and multivariate analysis. For detection, we can use visual methods such as histograms, box-plots or scatter plots and statistical methods, such as mean and standard deviation, clustering by examining distant clusters, small decision tree leaf nodes, Mahalanobis distance, Cook’s D or Grubbs' test. Judgement of what should be considered as an outlier is not as straightforward as identifying missing values. The decision should be based upon a specified criterion, for example: any value outside ±3 standard deviations, or ±1.5IQR, or 5th–95th percentile range would be labelled as an outlier.
Outliers can be treated in a similar way as missing values. Other transformations can also be utilised including: binning, weights assignment, conversion to missing values, logarithm transformations to eliminate influence of extreme values or Winsorization.
As discussed above, data cleansing may involve implementation of different statistical and machine learning techniques. Even though these transformations could create a more superior scorecard models, the practically of implementation must be considered as complex data manipulations can be difficult to implement, be costly and slow down model processing performance.
Once the data is clean, we are ready for a more creative part – data transformations. Data transformation or feature engineering is the creation of additional (hypothesised) model variables that are tested for significance. The most common transformations include binning and optimal binning, standardisation, scaling, one hot encoding, interaction terms, mathematical transformations (from non-linear into linear relationships and from skewed data into normally distributed data) and data reduction using clustering and factor analysis.
Apart from some general recommendations on how to tackle this task, it is the responsibility of the data scientist to suggest the best approach to transforming the customer data signature into a powerful information artefact – the mining view. This is probably the most creative, and the most challenging, aspect of the data scientist role as it requires a solid grasp of business understanding in addition to statistical and analytical skills. Very often, the key to creating a good model is not the power of a specific modelling technique, but the breadth and depth of derived variables that represent a higher level of knowledge about the phenomena under examination.
The rest is art of feature creation…