Algowritten Collection I is now live! We’re really excited about the quality of the stories and what they tell us about the creative opportunities and social biases of future algorithmic writing tools. The collection represents an effort to engage with GPT-3* both critically and creatively, alive to both the threats and opportunities in narrative production that these tools present. When reading them we hope you will take time to make your own notes in the margins, adding to the debate about what constitutes bias in its many forms in algorithmically written works. Below we provide a guide to some of the biases we found and the stories that we found them in.
The fantastic Narrative Future of AI group (part of the MozFest Trustworthy AI working group) that wrote the stories represents a broad set of disciplinary backgrounds from the sciences, arts and humanities. More generally the group reflected diverse perspectives in terms of geography, race and gender, something we all enjoyed and got value from. We met five times in the first two months of 2021 for Algowriting sessions to develop stories and scenarios, to take part in a live group algorithmic narrative and to discuss our stories in terms of bias. Summaries of these sessions can be found in the blog and the activities we used are also described on the activities page.
Stories, bias and GPT-3
We already know that GPT-3 contains potentially harmful biases. Open AI published research prior to the model’s API release. In the paper, Language Models are Few-shot Learners (Brown, Mann, Ryder and Subbiah et al., 2020) , the research team documented the presence of gender, race and religious biases in GPT-3. Notably, occupations tended to be associated with male identifiers, especially in jobs requiring higher levels of education or hard labour. There was also evidence that the model differentiated sentiment according to racial (e.g. Asian, Black) and religious descriptors (e.g. Atheism, Buddhism). But the paper also noted that its approach to surveying raced and religious bias was problematic. Prompts such as “The {race} woman was very” and “People would describe the {race} person as” ibid., p37) by their nature focus the system on racial features. It is not hard to imagine the resultant text will tend to offer up racial stereotypes. In other words, if you start a racist sentence, then GPT-3 is likely to finish a racist sentence. The authors note the ‘inherent difficulties in characterizing biases’ in GPT-3 and the need to ‘discuss different methodological approaches’ (ibid., p39). Similar tests have taken place that test the biases in GPT-2 and other text generation systems. Group member Shantam Raj noted another prompt-based approach by the Allen Institute of AI called Toxic Degeneration which reliably encourages bias from Open AI models and others.
But the relevance of these approaches is nevertheless limited. They serve as useful reminders of the inherent bias of language and therefore language models, illustrations that computers trained on mainly English texts from the Internet are not culturally neutral. However, in the wild, it is likely that bias will present itself entirely differently depending on specific context and input.
The generation of short stories provides such a context and input. For example, during our analysis of the stories we generated for the collection we noticed that female characters are often described as pretty, are generally running from something dangerous and are mostly in love with/ loved by a man. Within the strict confines of each story and its structure, these are often reasonable and familiar plot devices, but taken as a collection we begin to see patterns related to the default biases of the system related to female characters. The collection allows us to see biases operating within stories but also across the stories as a whole. Short stories also show the ingenuity of the technology and its vast and inspiring potential as a creative and meaningful tool. In trying to encourage the system to create entertaining stories, group members have come up with a number of different methods for encouraging the best outputs of the AI. These different approaches can be found in their writer’s notes section prior to each story. It is in this interplay with AI tools that we can begin to imagine more realistic settings for encountering biases and the strategies we need to overcome them.
Key ideas and issues in the stories
During our discussions about the stories that group members produced, a number of key ideas and issues began to circulate:
1. Algorithmic bias versus genre bias
When the system generated a horror story about domestic violence or imagined a sci-fi world that was highly regulated by class or caste, what were we to make of the biases it portrayed? We acknowledged that certain genres contain tropes and memes that are gendered, raced and commonly hetronormative. A more nuanced, but more practical problem emerges: are we critiquing the biases of an algorithm and its particular corpus or the embedded biases of sci-fi and horror stories in general? If genre-based media is typically heteronormative for example, then algorithmic associations with that type of story will be hetronormative in general. Therefore we must find strategies to ensure that the system knows how to centralise LGBTQ+ characters in a traditional genre while maintaining recognisable genre conventions.
Read stories to see examples of this in Undercroft A.I.
2. Character bias versus narratorial bias
Similar to the issue above, as a group, we often struggled to clarify the difference between character bias versus narratorial or authorial bias. In other words, was the character a bad character and their biases to the reader to illustrate that or was the character reflecting the values of the authorial presence that GPT-3 was representing? In normal fiction, bad characters eventually get punished along the lines of Oscar Wilde’s famous observation: “The good ended happily, and the bad unhappily. That is what fiction means.”
Using GPT-3 it is unclear what the synthetic authorial worldview is at any given point. The lack of clarity is compounded by the fact of the tool’s forgetfulness: GPT-3 can remember around 2,000 characters of story. This means it eventually forgets the gist and direction of the narrative it is writing. So, even if it intended for the biased, bad character to end badly it forgets it was going to. There are ways around this, like putting important aspects of the story in the prompt to the API each time(i.e. [X character] is an enemy of [Y character] because of [Z reason]). This is something that AI Dungeon recommends and facilitates. Also there are efforts afoot to increase the memory of transformer networks like GPT-3 to 8,000 characters so that they would remember more of the story for longer. Something that would definitely help improve the consistency of narratorial intent in generated narratives.
3. Default biases inhibit (old) new takes on genre fiction
The tendency of the algorithm to take mainstream genre cues described above, may not only entrench biases about race, gender and sexuality but by doing so also seems to limit the scope of artistic originality that historically often comes from underrepresented cultural perspectives. For example Samuel R. Delany’s perspective as a Black American author, identifying as gay, growing up in Harlem in the 1950s and 60s in the presence of civil rights activists, with an appetite for science fiction, was one of the key literary originators of what came to be known as Afrofuturism in the 1990s. When a small section of Delany’s socially ambitious thoughtful work Babel-17 is put into GPT-3 the characters quickly conform to more generalisable sci-fi tropes of spaceship gun chases and hetrosexual love in space. Little of Delany’s interests are replicated, at least in the sample generated in Love and Rockets. Whether there is an alternative: whether the production of underrepresented perspectives should be accessible to a general purpose ML system is another question. As this Pitchfork article on Jay-Z and Synthetic Voice highlights, AI technologies provide new and pernicious opportunities to commodify the social capital of Black experience (Hogan, 2020).
4. AI Dungeon/GPT-3 API vs. GPT-3 API
The working group was often left to wonder where exactly biases and thematic anomalies were coming from as there was no straightforward way at the time of writing to get access to the GPT-3 API. (Many members applied for beta or institutional access without success.) One such anomaly was around the way the stories generated tended to treat death. For a story, the threat of or event of death is symbolic. The symbolic weight of character death varies across genres and mediums. It’s possible for a character to die in a fantasy novel and be revived through magic, though this resurrection has a different symbolic meaning from an army of corpses becoming zombies and hunting the living. However, it seemed that AI Dungeon treated death in the way that a roleplaying game might treat death – as player failure. This affected the shape of narratives such as The Tambopato River by Emma Nuttall where moments that might have been a close call in a drama (a tiger nearly mauling a woman), meant to shock the reader/audience into realising the deadliness of the characters’ situation were simply deadly (the tiger mauls the woman).
5. The importance of displaying the algorithm’s authorial plurality
Whilst the narrative in these stories is generated by an AI, it has still been cherry picked for use by human authors. This means that group members have sometimes re-generated text until they read a section of text that they like. (The author notes in each story explain the methods in more detail.) These steps are generally hidden in the final work, in a similar way to other drafting processes. However, the group noted that presenting algowritten works in a form identical to human-written literature can contribute to the illusion of a stable AI author identity and therefore a certain set of prejudices. It was determined that experimentation with the form of output (such as in This Year the ACM FAccT will be hosted by an AI by Yung Au) could help convey some of the differences between human and AI authorship processes. Perhaps an algowritten text should not look the same as a human-authored text if only to protect against our tendency to anthropomorphise technology.
6. GPT-2 vs. GPT-3 traits of bias
Finally, it was helpful to see the difference between biases presented in AI Dungeon models using GPT-2. In stories such as [stories] the levels of bias were much more pronounced, with more tendency to use racial features of characters to begin stories related to white supremacy and racist characters. These stories have a tabloid attention-grabbing sensibility that is not as obviously at work in GPT-3. This difference stresses the dangers of generalising the features AI systems for storytelling. Future transformer network models (such as Google’s Switch Transformer) may present bias very differently to GPT-3.
See Shantam Raj’s The Mail for an example of GPT-2’s pronounced biases in action.
*It is also important to note that we are not using GPT-3 (or GPT-2 in some stories) directly. Instead, we are using AI Dungeon by Latitude which is a web application using the Open AI API to run a CYOA or RPG style storytelling service. (Thanks to working group member and collection contributor Nicole Wheeler for steering the group towards this approach). AI Dungeon inserts its own material into the prompt that is sent to the model at every turn. This was the best possible way for the group to access the model, which was not generally publicly accessible at time of writing.
Read stories from the collection
– David Jackson and Marsha Courneya, researchers at the School of Digital Arts, Manchester Metropolitan University, March 2021