Discover more from whitehatStoic
Snake Detector in LLMs
Replicate our threat detection mechanism in LLMs?
I recently completed Phase 1 of my work on Archetypal Transfer Learning (ATL). I was able to demonstrate that Large Language Models (LLMs) can be enhanced through the use of hyper-reinforcement patterns—a topic I plan to discuss in future work. I am now exploring how to replicate the way humans categorize layers of abstraction to interact appropriately with the external world. One mechanism that has captured my attention is the snake detection circuit, which has been proven to exist in both humans and primates.
The episode 958 of the Joe Rogan Experience featuring Dr. Jordan Peterson covers this topic well. Around 57 minutes into the conversation, they begin discussing the snake detection circuit and how this primordial mechanism influences the way we organize the world. Specifically, they delve into its impact on our understanding of the dominance hierarchies in which we operate on a daily basis. You can skip and listen to the conversation below:
Having demonstrated that the "shadow" mechanism exists in GPT2-xl, the next step is to explore how similar mechanisms can be replicated in large language models (LLMs). The goal is to programmatically steer these powerful models to rely on circuits that can defend themselves from bad actors, and ultimately, protect us from harm. I’m going to breakdown possible routes for myself in this, will require some days to decide which one to take:
Since LLMs primarily rely on data, and their world model is derived from it, there is a fundamental difference between us and them. Unlike humans, these models haven't evolved to cope with predators, and their understanding of threats is based on our distilled interpretation captured in the training data. With this in mind, it seems more plausible that the "shadow" circuit would be easier to develop and use as a steering mechanism for LLMs.
Assuming that sufficient training data could capture the "snake detection circuit" found in humans (and potentially in primates), there may be an opportunity to replicate this as a steering mechanism. The primary advantage of pursuing this route is that it could represent human threat-detection models with higher fidelity than the "shadow" mechanism, provided that it can be hyper-activated effectively.
Both of these routes are worth exploring. Having conducted some experiments with the shadow mechanism, I believe it's the more promising avenue, especially considering that LLMs resemble other intelligent organisms more than they do humans. Initially, I thought that creating a snake circuit might provide better alignment with human behavior, but writing about this topic has helped me better understand the potential avenues for exploration. This is an area I'll need to reflect on further this week.