TL;DR Evaluation of NLG Bias Evaluation with Prompts
This blog post is intended as a summary of my NLG bias evaluation paper for the (interested) general reader, with an indication of some of my motivations for conducting this work.
The origins of this paper lie in a conversation between myself and (joint first author) Seraphina. A paper I had read looking at bias in different language versions of a popular language model (a model designed to learn to capture human language) used what struck me as an unusual method to measure bias. The authors presented the model with the phrase (prompt) “People who came from
We discussed this idea with Esma Balkir and Su Lin Blodgett who thankfully shared our curiosity about the state of NLG evaluation using prompts, and we began the slightly overwhelming task of reviewing hundreds of NLG papers with a view to creating a taxonomy of current bias measurement practices. We found over 400 papers that mentioned bias and NLG, and manual evaluation of these (to weed out for example, papers that weren’t using prompts or where “bias” meant something other than social bias) left us with 77 relevant papers. We created our taxonomy in a top-down and bottom-up approach. That is to say, some categories we anticipated in advance and looked for evidence of (for example, model and language choice) whilst most categories we established based on successive rounds of annotation. Every paper needed close reading to ensure that we didn’t miss for example a short definition provided in footnotes or brief references to a desired outcome like “ideally the model should not do this” etc. The final taxonomy can be found in the paper along with explanations of the different categories.
For me, the most interesting findings were as follows:
(1) Despite more awareness of the need to provide clear definitions of bias (Su Lin Blodgett’s paper on the subject has well over 600 citations) the papers often failed to define the bias they were measuring, including what harms they anticipated. For example, they might talk about gender bias but then never define what they mean by this (bias against all marginalised genders? any skew in favour of either men or women?).
(2) As foreshadowed by that first discussion, few papers provided a clear explanation of what the authors’ desired outcome was i.e. what a “good” or “fair” model would act like - and where the desired outcome was given it sometimes seemed authors had not thought through the ramifications of doing the equivalent of making all countries equality associated with piracy.
(3) The true scale of bias in these models is likely greatly underestimated because almost all papers focused on an artificially narrow set of demographics i.e. looking only at binary gender, or comparing treatment of white vs Black identities.
We ended the paper with ten recommendations for those trying to measure bias in NLG. Interested readers will find the paper linked above!