The Center for Data Innovation spoke with Priscilla Alexander, vice-president of engineering and co-founder of ArthurAI, a startup based in New York whose software helps companies monitor, audit, and maintain the performance of machine learning systems. Alexander discussed how ArthurAI enables companies to explain the decision-making of the models they use, such as for investment decisions or in healthcare, while maintaining accuracy.
Eline Chivot: What did you co-found ArthurAI, and what does your company aim to achieve?
Priscilla Alexander: At Capital One, I led teams that built machine learning applications to solve specific problems for our card business teams. I was surprised by how many unique challenges there are in building working machine learning software, which at its core is a non-deterministic decisioning process. It was often an active, tedious, and manual process to make sure our data streams were always working and were accurate. There was a consistent need from our business partners to understand why our models were making certain decisions in order to trust the system. And we, of course, had to also actively monitor the machine learning decisions themselves to make sure the decisions were right and would not negatively impact our customers and our business, while staying compliant with banking regulations.
I also saw that these monitoring needs were consistent across the industry. I think there is a tendency to just “let the model figure it out” once it is in production—partly because some of these machine learning models are really complex and opaque, partly because building in good monitoring is just a hard and costly problem, and partly because data scientists and engineers don’t find monitoring a really engaging challenge. It’s a ton of work, but also a huge risk if you don’t have monitoring as it can have very real financial consequences for your business when something goes wrong and you don’t address it in real-time. At the time, there weren’t good third-party solutions that you could just plug in, so teams would build custom solutions that solved part of these problems, if they built anything to automate monitoring at all. I kept thinking: Why isn’t there a tool that just does this for me, so my team can focus on our business critical challenges?
I am helping to build Arthur because it fills a gap in the market which I experienced first-hand, and because I am driven by the idea that it could enable other teams like mine to move more quickly! Building the machine learning model is the fun stuff, so let us take care of the operational production monitoring for you.
Chivot: It is challenging to operationalize concepts currently discussed as AI principles, such as fairness and explainability, to all the various applications of machine learning. How does ArthurAI achieve this in practice? What are some of the misconceptions that you could debunk, when it comes to what AI can or cannot do?
Alexander: All models follow the same pattern: Take input, run some calculations, then spit out an output. The inputs are most often just a set of tabular data, an image, or a piece of text. Similarly, the outputs are often just a single number for a regression model or one or more labels for a classifier. There is a ton of variability in the model itself, but 90 percent of our metrics—for performance, fairness, and some explainability—don’t require that we understand the guts of the model. Because of that, we have built a very simple interface, with just a few lines of code, that allows us to instrument a model in our customer’s model serving architecture in just a few minutes.
Now, it was certainly a challenge for us to make this process look so simple to our customers. There was a lot of experimentation and tinkering in environments that replicate our customers’ environments. We’re probably on our third iteration of our software development kit and we are still finding new ways to tweak it to make it even easier to use. Recently, we added more detailed documentation and clearer error messages, but it’s definitely an ongoing process. We, of course, also have to keep up with any new trends in machine learning development and tooling.
One misconception is that AI can remove bias from decision making simply by automating a decision that was previously made by a human. But AI gets its intelligence from data so if that raw data inherently contains unfairness, then it will just automate the unfairness. This often is a factor in algorithms, like the COMPAS recidivism algorithm.
The good news is that if you use AI and you monitor for bias you can potentially automate the mitigation of some bias. It does require some effort but doesn’t have to be an overwhelming endeavor. Our platform can detect unfairness in your model, answering questions like “Was the outcome different for subpopulation A than sub-population B?” and “What features in the model drove the differences?” Once you find evidence of unfairness, there are different techniques to mitigate the outcomes, and some of those techniques can be automated and we support those in our platform. The trick is making sure humans are involved in deciding which technique to use as each of them have different tradeoffs—sometimes you may sacrifice some model performance for more fairness, or you may impact subpopulation B a bit more negatively by trying to make the model more fair for subpopulation A. Once you implement that mitigation, you will still need to continually monitor to ensure that the balance remains in place. Our chief scientist John Dickerson wrote a detailed blog post about this process.
Chivot: What are the industries you work with and the challenges you help them address? Why would companies find it valuable to work with organizations like ArthurAI rather than having their own AI teams build their own tools?
Alexander: More and more industries are adopting machine learning but organizations that manage risk, particularly healthcare, finance, and insurance have had statistical models at the core of their business for decades.
Healthcare companies are more recently focused on biased decision-making. They have a unique challenge as protected attributes, such as sex and gender, are important indicators of disease risk which means they should be features in their models. It’s delicate, balancing protected attributes in your model but also trying to make your model as fair as possible. The Arthur platform finds disparate impact by computing the difference in outcomes across subpopulations, and we are able to scale this comparison to hundreds or thousands of subpopulations—something that would be impossible to do manually.
In financial services, explainability is critical from a regulatory perspective. Many banks would love to use models like deep learning to get increased accuracy, but they often can’t explain the decisions, which is required by regulation. The Arthur platform operationalizes explainability for every inference and across all model decisions, such as for loan recommendations or investment decisions. We are able to tell how much each feature impacted a prediction.
Arthur is built with the enterprise in mind because a few of our co-founders, like myself, came from the enterprise environment. Large companies typically have multiple data science teams across the enterprise, each using a different set of tools, but model governance and risk is often managed centrally. For that reason, Arthur is platform-agnostic and we can integrate any model serving platform so you get the best of both worlds. That also future-proofs monitoring for any new model serving technology that comes along next.
While we have a software-as-a-service (SaaS) version of the product, we are also able to deploy the same architecture anywhere—whether you’re running AWS, GCP, or in a data center that has no access to the Internet. Arthur can run anywhere your models run so that you are in full control of your data.
Chivot: Can you give an example to illustrate how an organization you work with concretely benefits from your platform, in a sector where one might not expect it?
Alexander: We recently worked with Harvard’s Dumbarton Oaks Institute, which is a center for research in Byzantine Studies. They have a collection of thousands of photographs that capture architectural ruins in Syria and identifying those ruins took on a new urgency with the civil war there, knowing that many of them may be destroyed. We provided explainability for a computer vision model that analyzed the labeling of each photograph through the Arthur platform, showing which pixels influenced the software’s decision to apply particular labels. This allowed them to quickly understand where the training examples weren’t sufficient, greatly improving model accuracy and ultimately accelerating the process of categorizing artifacts from our global heritage. It was so exciting to see our platform have a very real impact in an area where machine learning is not commonly applied.
Chivot: What may be holding back the adoption of AI by companies, and where do you see AI as a monitoring technology expanding to other areas in the future?
Alexander: I’d say that there are four big factors that hold companies back from adopting AI: Availability of data, scarce and expensive talent particularly data scientists and data engineers, business leaders who haven’t been educated about where machine learning can drive value, and worry about regulatory compliance. These are all really valid reasons to hesitate to adopt AI but I do firmly believe that the companies who do will have a strategic advantage and experimentation with it is critical to innovation. Monitoring automates a huge part of the machine learning lifecycle which we believe makes adoption easier, ensures that data scientists can focus on a firm’s differential capability, and helps allay fear around regulatory risk.
The public is starting to become more skeptical of algorithms making decisions without any oversight, which is why we are starting to see more policies like the GDPR enacted. Companies are also beginning to recognize the reputational risk of not finding issues with their models before the public does.
These events are driving the need for AI transparency and monitoring, but the techniques that enable observability are still very much in their infancy. We’re just beginning to see the research community dedicate more resources to transparency and explainability, as the focus has been predominantly on making models more accurate. The explainability and bias mitigation techniques in the Arthur platform are bleeding edge, but there is a lot more research ahead of us in these fields and we’re excited to work with our research fellows to make a contribution. Explainability in particular has to keep up with all the latest advances in fields like deep-learning. Model algorithms are growing more complex, which means that explainability techniques will also need to keep pace.
At this moment, Arthur focuses solely on AI observability so it’s incredibly exciting to be on the cutting edge of operationalizing the latest research, actively participating in that research community, and helping companies use this technology so that they can get all the benefits of the latest model algorithms, while managing the risks.