1 Setting the scene

1.1 Introduction

Onde upon a time, there was a bank offering services to private persons. The services include managing of accounts, offerings loans, etc.

1.2 Task descripion

The bank wants to improve their services. For instance, the bank managers have only vague idea, who is a good client (whom to offer some additional services) and who is a bad client (whom to watch carefully to monimize the bank loses).

Fortunately, the bank stores data about their clients, the accounts (transactions within several months), the loans already granted, the credit cards issued.

The bank managers hope to improve their undestanding of customers and seed specific actions to improve services. A mere application of discovery tool will not be convincing for them.

1.3 Data description

This database was prepared by Petr Berka and Marta Sochorova.

Simplified logical data model of Berka Bank.

Simplified logical data model of Berka Bank.


2 Data ingestion, cleaning, translation and enhancement

Before starting the Berka Analysis, a few important steps were taken in order to prepare the source data files. These steps are listed below:

2.1 Create Functions

This step create functions to be used in the next steps. Following, all functions created are described.

2.1.1 GetGenderFromBirthnumber

The birth_number column is given in the form of YYMMDD for men, and YYMM+50DD for women. The objective of this function is to return the gender of the client via the birth_number.

2.2 Data Ingestion

The process of data ingestion — preparing data for analysis — usually includes steps called extract (taking the data from its current location), transform (cleansing and normalizing the data), and load (placing the data in a database where it can be analyzed).

During this step, in addition to the loading data processes, it was performed data casting, column renaming and small touch-ups. The list below describe each table adjustment taken:

  • District: renaming columns and casting columns with decimal or “?” values;
  • Credit Card: casting column issued in creditcard table from string to datetime data type;
  • Account: casting column date in account table from string to datetime data type;
  • Loan: casting columns in table loan to the right data types;
  • Permanent Order: casting columns with decimal values;
  • Transaction: casting columns in table transaction to the right data types.

2.3 Data Cleaning

The objective of this step is analysing missing values and other strange conditions. In order to accomplish this task, a few R functions were used to quickly discover missing values, like NA and empty fields.

First thing done, was fixing observations in k_symbol transaction table with ’ ’ (one space) to empty string (’’), using the following command.

Then, the command below was used to find out any NA values in each table.

Solely the transaction table has NA values, in the following columns:

x
trans_id 0
account_id 0
date 0
type 0
operation 0
amount 0
balance 0
k_symbol 0
bank 0
account 760931

Finally, the following command was used in each table to find out where empty values was hidden.

Again, only the transaction table had empty values, according to the table below:

x
trans_id.NA NA
account_id.NA NA
date.NA NA
type.NA NA
operation.TRUE 183114
amount.NA NA
balance.NA NA
k_symbol.TRUE 535314
bank.TRUE 782812
account.NA NA

For the exploration analysis report, we did not take any additional action, since the missing values was not relevant.

2.5 Data Enhancement

This step aims to improve the analysis by adding auxiliary information. Data enhancement is all about making sure any data that is coming into the business is being looked at with a critical eye and is being filtered down to maximize its value.

The code below get gender, birthday and age from birth_number column using GetGenderFromBirthnumber and GetBirthdateFromBirthnumber functions.

The code below improved loan data by having a classification regarding its payment status.

The code below improved client data by having its age group.

The code below calculate an additional table with current and average account balance for each account.

The code below calculate an additional table with the proportion of each transaction type (k_symbol) on total transaction amount of each account. That data will be used to fit a logistic model on loan dataset.


3 The explorations

3.1 Gender Exploration

At first glance, gender equality is well balanced in the bank, even when observed over the decades. Even more impressive, gender equality is everywhere in the country.

3.2 Loan Exploration

Here we investigate if there is any association between the region and the likelihood of default in the 682 loan observations in the dataset.

We perform the same investigation on the association between the client gender and the likelihood of default in the 682 loan observations in the dataset.

We finnaly do the same investigation on the association between the client age and the likelihood of default in the 682 loan observations in the dataset.

It seems that none of these features, alone, are determinant on the odds of a client default.

3.3 Account Balance Exploration

As we do not have the individual income of each client, we tried to perform a proxy of the client’s wealth by calculating the average account balance of each account and investigating the distribution by region and age.

One interest pattern shows up, males in prague region tend to have a bigger average account balance them females, and the oposite trend in South, North and West Bohemia.

3.4 District exploration

Regarding the Czech regions, we can notice in the exploration below which regions are more likely to have defaulters based on the finished loans operations so far.