Nearly five years later, Cathy O’Neill’s book has held up well in the fast-moving field of machine learning. Awareness of the risks and real problems with naive algorithm development is much greater than when the book was written, though many, if not most, of the same issues persist. Just like it is harder and more costly to build software that protects people’s data privacy, it is harder and more costly to build algorithms that are relatively unbiased and treat different groups equitably. I think the design and deployment of these algorithms is generally not malicious. Fast and cheap is the root of a lot of sins related to quality. In this case, a lack of training, education, and awareness on how to detect and reduce bias is also a big factor.
One point O’Neil made in the conclusion that really hit home was that the negative impacts of many of the algorithms she described tend to be worse for groups that already experience a great deal of inequity, and thus it reinforces their situations and makes it even harder for them to escape. While not quite the same, this reminded me of a recent article I read on data cascades that focused on how data quality issues can grow as the usage of the data cascades through downstream use cases. A simple analogy is the old telephone game in which a relatively short message is relayed one by one via a series of people. The errors introduced at each step may be small but they compound as the number of people grows, often resulting in a very different story after a relatively short number of retellings.
One aspect of the relevance here is that data provenance is often not well understood, leading to bad assumptions about the appropriateness of the data used to train a model. Data sets are sometimes used without an understanding of the noisiness of the data or the conditions under which it was collected. Inappropriate usage invariably leads to incorrect conclusions. But at a more conceptual level, it made me realize that as bad as the direct inequity of some of these algorithms may be, the indirect inequity that is compounded by certain groups’ past experiences and the combined impact on their likely future experiences is even worse.