27-01-2019

How the Data Science Elite helped uncover a gold mine at Experian

Deel dit bericht

Experian’s Business Information Services (BIS) unit spent years building a valuable and unique database of corporate relationships to better serve customers. The unit created this database of corporate hierarchies using technology to do entity matching, and people who evaluated the matches and updated them based on research and human evaluation.

This process was time-consuming and limited the number of company hierarchies the team could evaluate and match. It maintained a subset of corporate hierarchies out of a full universe of companies available to them due to the intense manual effort required. Plus, because of how much human involvement was needed, they were limited on how often they could refresh the hierarchies.

The IBM Data Science Elite team had a simple mission: apply AI to learn what Experian has done over the years building corporate hierarchies and then apply that to the full universe of companies that they traditionally couldn’t evaluate. The goal was to increase the number of corporate hierarchies and increase the frequency of corporate hierarchy matching.

The results? AI and machine learning are now helping Experian solve a problem building and maintaining business families and corporate linkages with a potential 500 percent increase in coverage and 80 percent reduction in cost.

Our team began with a discovery workshop to understand the problem. This included digging into the data with an open discussion of the current process and what they would like it to be. This included a team of business experts from Experian, plus stakeholders who understood the value of a new solution and an IBM Data Scientist Elite who could help develop a new approach. After the workshop, we put together a plan which included leveraging machine learning. The plan was to train new machine learning models using Experian’s existing, validated hierarchies. Each hierarchy is a company with thousands of sub companies matched with years of Experian expertise, intellectual property and software.

We then documented our approach, defined some agile sprints and moved to project kickoff.

Next we needed a platform – something with key open source data science and machine learning libraries that would meet Experian’s strict guidelines on encryption and key management. It would have to scale with the horsepower needed for such a complex problem. We especially needed to use GPUs, and we needed something that was quick to get started.

With that in mind, we spun up a Watson Studio environment to perform our modeling; Watson Machine Learning for our model deployment and scoring; Object Storage for Data Storage and Key Protect for Data Encryption and Security. All components were spun up on IBM Cloud for the project.

We started with the data. We uploaded several extracts which included Experian’s base data files. Those files contained corporate hierarchies and relevant features such as address, city and website and many others. This added up to millions of rows of data. The team had to perform several cycles of data sampling, understanding, preparation and definition. We did this working very closely with business stakeholders.

In the next sprint, the team performed modeling, which included feature engineering, blocking and evaluation of several machine learning techniques, including binary classification algorithms, logistic regression, neural networks and recurrent neural networks (RNN).

The team determined that RNN in a binary classification achieved the best results with 95 percent accuracy. Matching the hierarchies previously took years of application and manual work. But now with a new RNN model, the model found more matches then the existing process with very good accuracy. In the final sprint, the team deployed, validated and scored additional hierarchies using the IBM Watson Machine Learning deployment service.

In a few months, with the goal of scaling AI to impact all corporate hierarchies in BIS, the team had a validated an approach to a new, innovative AI system for corporate hierarchy matching. An aspect of project was to estimate the computational needs and system design for a full entity matching system. We estimated that to launch a full entity matching system with the current data and a 4-way ensemble of RNNs, Experian would initially have to train hundreds of models. This would require access to a great amount of GPU processing, and we would need to build several components that would have to interact.

We sketched a workflow for the entity matching system that we proposed to be run on IBM Cloud.


data_science elite experian.png



This was just the start. In a short time, the team developed 16 notebooks for data preparation, blocking, modeling and predictions. The language of choice was Python, with a heavy reliance on libraries including Pandas, NumPy and Keras with a Tensorflow backend.

The work set the BIS team on path to free them from a manual process by using AI.

Carlo Appugliese is Machine Learning Program Director IBM Analytics.

Partners