Getting to grips with data version control
Willem Conradie, CTO of PBT Group
Just like data quality, data version control is essential for the effective application of artificial intelligence (AI) and data science in organisations.
Wikipedia defines data version control as: “a method of working with data sets. It is similar to the version control systems used in traditional software development, but is optimised to allow better processing of data and collaboration in the context of data analytics, research, and any other form of data analysis. Data version control may also include specific features and configurations designed to facilitate work with large data sets and data lakes.”
For AI and data science projects in organisations to be productive and properly governed, aspects like tracking changes to datasets, reproducibility, collaboration, integration with version control systems, and storage and retrieval of datasets, are essential.
Many examples can be found on the Internet where AI was applied in real-world situations where things went horribly wrong. One such an example, according to an article on CIO.com, is where iTutorGroup used AI enabled software to either approve or reject job applicants. The software automatically, but erroneously, rejected female applicants ages 55 and older and male applicants ages 60 and older. In total, over 200 qualified applicants were incorrectly rejected. The US Equal Employment Opportunity Commission (EEOC) brought a suit against iTutorGroup for this discrimination based on age. iTutorGroup subsequently paid $365 000 to settle the suit.
There are many reasons why situations like these occur.
AutoML tools, like DataRobot and Dataiku, are widely used by so called citizen data scientists. A citizen data scientist is seen as a person who does not have the formal background or training of a data scientist however does some of the work a data scientist does. AutoML tools make the whole data science lifecycle extremely easy, thereby getting AI or an ML model, into a production application is fairly straightforward. However, the practical implications of what might go wrong once in production are not always well understood. This is where challenges tend to occur, and the wheels come off.
These AutoML tools don’t necessarily store the datasets that were used for testing. When things do go wrong and the AI acts erroneously, doing root cause analysis becomes very difficult and time consuming for the citizen data scientists. To reproduce model results, the datasets need to be manually recreated. Then they must hope the ML model results are similar.
Only after all of this manual prework is concluded can they start delving into the actual erroneous actions the AI performed to try and understand what went wrong at a detailed level, and then they still have to try and fix it. A lot of time is wasted during such an onerous manual process.
On the other hand, data scientists face many challenges during the data science lifecycle. As part of training a ML model, data scientists make changes to input datasets to try and improve the model performance. If these changes to the datasets are not tracked it could happen that a new dataset causes the model performance to deteriorate. The new dataset will have overwritten the previous one which resulted in better model performance. The data scientist will then have to spend a lot of time preparing a dataset similar to the one that was overwritten in an attempt to get back to a better performing ML model.
Given these scenarios, had the citizen data scientists and data scientists used data version control to version the datasets, it would have been a much more efficient process to switch between dataset versions and reproducing the more efficient ML model with minimal effort. A lot of time and energy would have been saved compared to doing it the manual way.
Another challenge faced during the data science lifecycle is to make results reproducible. In data science, at a high-level, reproducibility concerns with code versioning, data versioning, and random seeds when used.
Returning to the iTutorGroup example and to put this into context – it will be much easier for them to do root cause analysis on where the AI went wrong with rejecting qualified candidates, having implemented data version control as part of their data science lifecycle. Having not done so, it will be a very tedious and time-consuming task to figure out why the AI made erroneous decisions. They will have to manually try and reconstruct the datasets and ML model(s) the AI is based on to understand what went wrong.
So, how can these challenges be addressed practically in a proactive manner, rather than reactive?
It can be immensely valuable to educate the various stakeholders impacted by the AI and data science initiatives within an organisation. Different types and levels of training can be provided to the different types of stakeholders. Training should focus on the AI and data science lifecycle, best practices within the lifecycle and its practical application.
It is vital to highlight the importance of version control, and within that data version control and its value to an organisation. Not all organisations are the same, or have the same constraints imposed on them. Accordingly, different data version control strategies exist which organisations can decide to adopt based on their own unique circumstances.
Furthermore, it is just as important to incorporate AI and data science best practice into business processes, deployment procedures, and support and maintenance procedures. One can also consider AI and data science platforms, coupled with code version control software and data version control software, to automate the above-mentioned processes and make the end-to-end more efficient.