Cloud Native ComputingContributory BlogsDevelopersDevOps

A Pragmatic Approach to Data Science

0

Organizations can create more effective collaborative teams with data science. In part one of this two-part series, we explored best practices for data science and how to avoid common challenges. Now, let’s examine how to make data science more pragmatic through a case study example.

Three Approaches to Data Science

Depending on the maturity of the organization, there are three main approaches to deliver business value from data science:

  1. Data Science Workshops to help identify initiatives and rank them in terms of return on investment and time to completion.
  2. Conduct data capability and infrastructure assessments to make sure the team can achieve the initiatives outlined.
  3. A cross-industry standard practice for a process for data mining called CRISP-DM. This allows teams to get value quickly and iterate and fail fast. The CRISP-DM method, makes data science pragmatic for most organizations. The process ensures the business stays engaged, business problems are solved and iterations happen quickly to get faster value for the organization.

The following real-life example will explore how these approaches can offer a pragmatic approach to data science with the CRISP-DM method illustrated above.

Real-Life Data Science Example:

Business Understanding: What does the business need?

In this use case, a federal credit union obtained a new member base from a partnership with a retailer. Together, they offered a co-branded credit card which expanded the credit union’s geographic footprint in several states. However, the new member base was riskier than its core member base.

After going through Data Science workshops, this new member base was identified as a top strategic priority for the data science team to focus on. Leadership wanted to know how to make the portfolio more profitable for the organization. Due to the riskier member base, the data science team focused initially on reducing exposure from risky account holders.The main objective for the project was to predict unhealthy accounts for credit line decreases.

Data Understanding

To begin, they explored the following questions to help identify the data sources:

  • What data do we have?
  • What data do we need?
  • Is the data clean?
  • Is the data ready for processing?

The credit union found three disparate data sources that were not located in a centralized system for easy access. The data was also vendor maintained, meaning there was an extract file component. While the data science team could review historical one-time data dumps, they knew that once the model was ready for production, the data wouldn’t be available for real-time scoring.

The data science team made a conscious decision to work in parallel with the data engineers to make sure they could get the data and pipelines ready for production for as close to real-time scoring as possible.

Next, the data science team started to understand the data better in order to define an unhealthy account with the data provided. The information can be represented in the data in multiple ways, so it was important to work with the business and domain experts of those systems of record to understand the most accurate approach and if processes were consistent over time. It’s critical to get feedback at this stage so that valuable time isn’t wasted only to find out that the data was bad to begin with. Therefore, it was important to make sure the business was engaged at this step.

Data Preparation: How to organize the data for modeling

The data had to be set up in a way that would mimic how it works in reality and production. This means ensuring that each row, or observation, containing predictive variables must resemble the information that could be obtained about an account. For example,  if all accounts were scored after being  open for one week, then the data science team needed to ensure it was training the model on only one-week old accounts. Each historical record of the account is then paired with what is called a target variable which represents if the account ultimately became a healthy or unhealthy account.

The process of feature engineering involves calculating additional features that improve the accuracy of the model. In this case, the data science team created hundreds of calculated features like rate of spend at different increments, moving averages, etc. The team set up a feature store to capture and automatically calculate these engineered features every month. This enabled the reuse of the same feature set for other models.

Modeling & Evaluation: Selecting modeling techniques. Which model best meets the business goal?

When the data was structured and ready for modeling and in the right format – the team needed to determine which modeling techniques to apply, and which model best meets the business goals. The data science team also had to make sure that the models were compliant with the Credit Card Protection Act, meaning the models could not unfairly discriminate toward protected classes.

This most important facet of modeling is to ensure that the model will generalize to reality. The way data scientists estimate a model’s performance is by withholding a percentage of the possible training rows into what is called a test set. The model is then trained on the training set and then predicts each row of the test set in order to estimate how the model will generalize on unseen data, data that wasn’t used to train the model. Based on the type of prediction problem, different error metrics are calculated by data scientists in order to rank the performance of different types of models, with different sets of features. The model and feature set combination with the lowest error rate is chosen as the best model and the candidate for production. However, it is critical to make sure the business understands this error rate and is comfortable with it, which may mean translating the statistical metric into something the business can understand. For example, based on r test data,the team estimated that it could identify 80% of the bad accounts at the expense of 10% of the good accounts. Context here matters. What’s the baseline rate without using the model? Is this a lift over the baseline?

Deployment: How do stakeholders access the results?

Once the business was comfortable with the error rate, the data science team had to figure out how to deploy the model so the stakeholders and organization could interface and access the results and the scores. In reality, this is something an experienced data scientist would think through at the onset of the model creation – at least have some general idea about how the scores will be consumed.

In this case, the data science team was augmenting an existing quarterly process for manual credit card decreases based on a FICO business rule based approach. It was incredibly manual and involved a lot of data stitching in Excel. Then, accounts had to be formatted properly in a file that was sent to the credit card vendor to actually implement the line decreases with the appropriate legal documentation. Therefore, a dashboard connected to the model scoring process was created to greatly automate the process. With the dashboard, the credit card team could click a button and get all of the accounts that needed a credit card line decrease to the right amount, in the format needed for the credit card vendor to implement the credit card line changes, along with the appropriate legal documentation.

Another element of consideration was the exception handling of scoring records in production. There are many reasons why a machine learning model may not score a particular record, usually due to missing data, or how data is imputed for the missing records, or if the data was not allowed. Therefore, it’s important to have a clear audit trail of model scoring, especially when it has legal consequences.

Data Science Results

Now that the system is used in production, company leaders can see what data science can do for their business. The goal for the data science team was to make the portfolio more profitable. This was accomplished through data science by reducing exposure to risky accounts and reducing a lot of the manual processing time that the credit card team historically had to do to implement credit line changes.

The data science team set up the predictive features in such an automated and streamlined way that they can be pivoted and used in any other use case for this portfolio. This allows the business to get to value quickly, and to iterate and make sure that data science is solving business goals in a shorter amount of time. For example, another model was created to predict healthy accounts that should get a credit line increase following a similar process.

The data science team implemented this model in two three-week sprints, with most of the model development time spent validating that the model is meeting regulations. With the right foundation laid and collaborative work across the organization, it is possible to build data science models quickly and effectively.


By Max Carduner, Data Science Solution Architect at Exadel
Carduner has more than seven years of experience in professional analytics and has spent more than four years in data science. He holds an MS in Data Science