Utilizing archetypes to embed human ethics in language models

Utilizing archetypes to embed human ethics in language models

The ethical (AI) alignment research agenda I'm pursuing

No transcript...

The pursuit of ethical alignment in generative artificial intelligence stands as a formidable challenge within the field today. 

As AI systems, especially language models, become increasingly woven into the fabric of our daily experiences, their decisions and outputs carry substantial ethical weight. This blog post delves into a research project I have embarked on over the past year, exploring the potential of mimicking evolutionary psychology within language models to address this issue.

Patterns have been a constant presence since the inception of early life forms. 

I hypothesize that even the earliest bacteria possessed mechanisms to navigate their environments purely for survival. Delving into evolutionary concepts raises questions about how natural selection facilitated the movement of early organisms, their search for sustenance, or simply their survival. Recognizing patterns, I propose, played a crucial role as indicators for triggering specific actions. Over time, certain patterns emerged as both useful and efficient, making them viable for continuation through subsequent evolutionary stages. A prime example of such a dominant early pattern, I believe, is sound.

Sound, as a mechanism for survival, eventually evolved into language. 

It seems highly likely that the initial patterns we expressed were combinations of sounds signaling survival-related actions. This is mirrored in the behavior of monkeys, which emit particular sounds or calls in response to threats from above (like eagles) or below (like snakes). These specific calls, refined over evolutionary time, became ingrained in early humans, given that our vocal cords are a remarkably versatile and nuanced source of sound. Our voices are so distinctive that we can identify individuals by their voice alone. However, the key point is that as certain sounds were repeated over evolutionary periods, those that were meaningful were retained, forming the foundation of our language.

In the realm of linguistics, morphology examines the formation of words and their interrelations within a language. 

The transformation required for language development is evident when examining the language map presented earlier. This necessitates a study of how the relationships between words establish patterns that may be either beneficial or irrelevant. This process mirrors what occurs in a language model during pre-training, as it analyzes text and integrates it into its neural network. All collections of words adhere to a specific morphology, necessitating the model to adjust its weights accordingly during the training phase.

Our evolutionary learning is encapsulated in stories. 

Stories represent universal patterns; we cannot formulate any plan without considering its initiation, execution, and potential conclusion. I contend that this reasoning was replicated by our early primate or human ancestors, influencing their choice of words to retain in a language. While this may oversimplify historical complexities, it highlights how repetitive systems could have facilitated a simple yet effective selection process. Narrating a story was more efficient than describing individual elements like lions, deer, fruits, or snakes in ancient times. Our survival hinged on conveying narrative patterns to others within a tribe, proving both powerful and straightforward to implement.

This rationale underpins my efforts to replicate the evolution of morality through stories (or other morphological forms). I believe that stories encapsulate how our ancestors navigated their primal nature. The way we construct our actions around narratives accurately reflects reality. This is relevant to large language models because they utilize similar building blocks, with a significant portion of their training data incorporating stories. The structure of any text, including its beginning, middle, and end, represents a universal challenge and stories have been used to reduce this challenge: as I think the structure of a story is applicable to any kind of text - may it be Q&As, scientific papers, books, poems, magazine articles, and blogs alike.

Our ancestors identified and preserved pivotal stories, shaping the foundation of our collective psyche through archetypes that continue to resonate with us today. 

These archetypes represent enduring narratives that have transcended time, illustrating universal themes such as the valor of the hero, the darkness lurking within humans, and the sage's wisdom. Often referred to as tropes, these patterns are universally understood because they have been continuously engaged with across generations, serving as tools to navigate the complexities of reality. The influence of archetypes extends across various domains, including culture, religion, art, literature, and business, underscoring their integral role in shaping human understanding and interaction with the world. These foundational narratives are not merely stories; they are embedded deeply within our subconscious, providing a framework through which we interpret and operationalize reality. Throughout history, figures ranging from priests and kings to dictators, artists, and politicians have leveraged these archetypal patterns to navigate and influence their respective realms, for better or worse.

I aim to replicate the use of archetypes in the development of human ethics within AI. 

Initially, I believed that archetypes, similar to those utilized by our ancestors and drawn from human texts, could effectively be applied to language models. Although there are parallels, the processes of evolutionary development and stochastic gradient descent (a machine learning technique) differ significantly. Nonetheless, I maintain that utilizing archetypes as a strategy holds considerable potential and warrants extensive exploration. This conviction led me to develop a method I call Reinforcement Learning using Layered Morphology, designed to empirically test the feasibility of incorporating human ethics into language models through this theoretical framework. The application of this method to GPT-2XL has yielded promising results, demonstrating its capacity not only to refuse harmful requests but also to prevent notable jailbreak attacks.

While this post omits several details, I believe it sufficiently conveys the rationale behind my commitment to this approach as a solution to the ethical alignment challenge in AI. Should any aspects remain unclear, feel free to message me in the comments section.


If you found this post helpful, consider clicking the share button or explore my posts on mental modeling in the age of generative AI. Additionally, if you are interested in viewing the services I offer, please feel free to browse them here. Also, consider checking out Doc Ligot’s webinars on Generative AI; the link is available here.

Exploring evolutionary psychology and archetypes, and leveraging gathered insights to create a safety-centric reinforcement learning (RL) method for LLMs