Credit Scoring: Part 3 - Data Preparation and Exploratory Data Analysis





Data Science


By: Natasha Mashanovich, Senior Data Scientist at World Programming, UK


"Garbage in, garbage out" is a commonly used axiom in computer science and a threat to a project’s success – the quality of output is largely determined by the quality of input. Therefore, Data preparation is a key aspect of any data mining project including development of a credit scorecard. This is the most challenging and time consuming phase of the CRISP-DM cycle. At least 70%, sometimes more that 90%, of total project time is dedicated to this activity. It involves data collection, combining multiple data sources, aggregations, transformations, data cleansing, "slicing and dicing", and looking at the breadth and depth of data to get a clear understanding and to transform quantity of data into quality of data so that we can prepare confidently for the next phase – Model Building.

The previous article in this series, Credit Scorecard Modelling Methodology discussed the importance of a model design and identified its main components including unit of analysis, population frame, sample size, criterion variable, modeling windows, data sources and data collection methods. Careful consideration of each of the components is imperative for successful data preparation. The final product of this stage is a Mining View encompassing the right level of analysis, modeling population, independent and dependent variables.

Mining view componentApplication scorecard case study example
Unit of analysisCustomer level
Population frameLoan applicants with previous bad debt history
Sample sizeThrough the door applicants during 2015 and 2016
Data sourcesCredit bureau data, applicant data, age debt history
Independent variablesMixture of nominal, ordinal and interval data, such as aggregated values, flags, ratios, time and date values
Dependent variableDefault status (1 or 0)
Operational definitionsDefault: 90 days past due
Observation windowHistorical credit bureau customer information over the period of three years
Performance windowOne year

Table 1. Model Design Components

Data Sources

"The more, the merrier" – As part of data understanding step, any external and internal data sources should provide both quantity and quality. Data utilized must be relevant, accurate, timely, consistent and complete, while being of sufficient and diverse volume to provide a useful result in analysis. For application scorecards where there is limited amount of internal data, external data has prevalence. In contrast, behavior scorecards utilize more of the internal data, and are typically superior in terms of predictive power. The common data sources that would be required for customer verification, fraud detection or credit grant are outlined below.

SourceCategorySupplied by authority
ExternalAddress, postcodeCredit Bureaus
Bureau searches
Electoral roll data
Financial accounts
Court and insolvency
Generic bureau scores
Account managementLenders
Product details
Performance data
Marketing campaigns
Customer interactions

Table 2. Data Sources Diversity

The Process

The data preparation process starts with data collection, commonly referred as ETL process (Extract-Transform-Load). Data integration combines different sources using data merging and concatenation. Typically, it requires manipulation of relational tables using a number of integrity rules such as entity, referential and domain integrity. Using one-to-one, one-to-many or many-to-many relationships, the data is aggregated to the desired level of analysis so a unique customer signature is produced.

Figure 1. Data Preparation Process

Data exploration and data cleansing are mutually iterative steps. Data exploration includes both univariate and bivariate analysis and ranges from univariate statistics and frequency distributions, to correlations, cross-tabulation and characteristic analysis.

Figure 2. EDA (Univariate View)

Figure 3. EDA (Characteristic Analysis)

Following exploratory data analysis (EDA), the data is treated in order to increase the quality. Data cleansing requires good business and data understanding so the data can be interpreted in the right way. This is an iterative process designed to remove irregularities and replacing, modifying or deleting these irregularities as appropriate. Two major issues with unclean data are missing values and outliers; both can heavily affect the model accuracy and careful intervention is therefore imperative.

Before a decision is made on how to treat missing values, we need to understand the reason for missing data and understand distribution of missing data so we can categorize it as:

  1. Missing completely at random (MCAR);
  2. Missing at random (MAR), or;
  3. Missing not at random (MNAR).

Missing data treatment often assumes MCAR and MAR, while NMAR are more difficult to deal with. The list below provides the common treatments ordered by complexity.

Missing Data TreatmentDescription
Leave missing data
  • Small percentage of missing values could be tolerated
  • Missing values have special meaning and would be treated as a separate category
Delete missing data
  • Listwise (complete) or
  • Pairwise
    Pros: simple and fast
    Cons: reduce statistical power, problematic on small datasets
Single imputation
  • Mean, mode, median; add missing_flag for adjustment;
    Pros: simple, fast and uses the complete dataset
    Cons: reduced variability, ignores relationship between variables; not effective where that data contains a large amount of missing values (typically more than 5% of the data)
Model-based imputation
  • Regression
    Pros: simple
    Cons: reduced variance
  • KNN imputation
    Pros: imputes categorical and numeric data
    Cons: performance issue on large datasets
  • Maximum likelihood estimation
    Pros: unbiased, used the complete dataset
    Cons: complex
  • Multiple imputation
    Pros: accurate, cutting-edge machine learning technique
    Cons: difficult to code without a special function

Table 3. Missing Data Treatments

Outliers are another "beast" in our data as their presence can violate statistical assumptions under which we develop a model. Once identified, it is important to understand the reasons for having outliers before applying any treatment. For example, outliers could be valuable source of information in fraud detection; hence, it would be a bad idea to replace them with a mean or median value.

Outliers should be analyzed using univariate and multivariate analysis. For detection, we can use visual methods such as histograms, box-plots or scatter plots and statistical methods, such as mean and standard deviation, clustering by examining distant clusters, small decision tree leaf nodes, Mahalanobis distance, Cook’s D or Grubbs' test. Judgment of what should be considered as an outlier is not as straightforward as identifying missing values. The decision should be based upon a specified criterion, for example: any value outside ±3 standard deviations, or ±1.5IQR, or 5th–95th percentile range would be labeled as an outlier.

Outliers can be treated in a similar way as missing values. Other transformations can also be utilized including: binning, weights assignment, conversion to missing values, logarithm transformations to eliminate influence of extreme values or Winsorization.

As discussed above, data cleansing may involve implementation of different statistical and machine learning techniques. Even though these transformations could create a more superior scorecard models, the practically of implementation must be considered as complex data manipulations can be difficult to implement, be costly and slow down model processing performance.

Once the data is clean, we are ready for a more creative part – data transformations. Data transformation or feature engineering is the creation of additional (hypothesized) model variables that are tested for significance. The most common transformations include binning and optimal binning, standardization, scaling, one hot encoding, interaction terms, mathematical transformations (from non-linear into linear relationships and from skewed data into normally distributed data) and data reduction using clustering and factor analysis.

Apart from some general recommendations on how to tackle this task, it is the responsibility of the data scientist to suggest the best approach to transforming the customer data signature into a powerful information artifact – the mining view. This is probably the most creative, and the most challenging, aspect of the data scientist role as it requires a solid grasp of business understanding in addition to statistical and analytical skills. Very often, the key to creating a good model is not the power of a specific modeling technique, but the breadth and depth of derived variables that represent a higher level of knowledge about the phenomena under examination.

The rest is art of feature creation…

Would you like to discuss your requirements or arrange a demo?