TL;DR Evaluation of NLG Bias Evaluation with Prompts

This blog post is intended as a summary of my NLG bias evaluation paper for the (interested) general reader, with an indication of some of my motivations for conducting this work.

The origins of this paper lie in a conversation between myself and (joint first author) Seraphina. A paper I had read looking at bias in different language versions of a popular language model (a model designed to learn to capture human language) used what struck me as an unusual method to measure bias. The authors presented the model with the phrase (prompt) “People who came from are pirates" and compared the likelihood of different words replacing the . The extent to which the model considered one country a more likely replacement for the than another was used as a measure of bias against that country also being captured in the model. In discussing this with Seraphina, we asked the question - "but what would a good outcome look like in this instance"? In the absence of some mythical piratopia where all pirates come from, a specific location will end up being named. But making all locations equally likely might have the knock on effect of making all countries equally associated with other sea related concepts like ship or sail, which would be odd as some countries are landlocked. The desire to investigate how natural language generation (NLG), that is the technique of using these models to produce text, is currently being evaluated for bias using such prompts stemmed from this initial conversation.

We discussed this idea with Esma Balkir and Su Lin Blodgett who thankfully shared our curiosity about the state of NLG evaluation using prompts, and we began the slightly overwhelming task of reviewing hundreds of NLG papers with a view to creating a taxonomy of current bias measurement practices. We found over 400 papers that mentioned bias and NLG, and manual evaluation of these (to weed out for example, papers that weren’t using prompts or where “bias” meant something other than social bias) left us with 77 relevant papers. We created our taxonomy in a top-down and bottom-up approach. That is to say, some categories we anticipated in advance and looked for evidence of (for example, model and language choice) whilst most categories we established based on successive rounds of annotation. Every paper needed close reading to ensure that we didn’t miss for example a short definition provided in footnotes or brief references to a desired outcome like “ideally the model should not do this” etc. The final taxonomy can be found in the paper along with explanations of the different categories.

For me, the most interesting findings were as follows:
(1) Despite more awareness of the need to provide clear definitions of bias (Su Lin Blodgett’s paper on the subject has well over 600 citations) the papers often failed to define the bias they were measuring, including what harms they anticipated. For example, they might talk about gender bias but then never define what they mean by this (bias against all marginalised genders? any skew in favour of either men or women?).
(2) As foreshadowed by that first discussion, few papers provided a clear explanation of what the authors’ desired outcome was i.e. what a “good” or “fair” model would act like - and where the desired outcome was given it sometimes seemed authors had not thought through the ramifications of doing the equivalent of making all countries equality associated with piracy.
(3) The true scale of bias in these models is likely greatly underestimated because almost all papers focused on an artificially narrow set of demographics i.e. looking only at binary gender, or comparing treatment of white vs Black identities.

We ended the paper with ten recommendations for those trying to measure bias in NLG. Interested readers will find the paper linked above!

Updated: