Evaluating large language models for cyberbullying behavior

In the Brachio Lab, doctoral students at Penn Engineering probe AI models for signs of cyberbullying capabilities. This emerging problem with the rise of AI may pose challenges in areas like business, education, and public health.

Ph.D. students in Penn’s Brachio Lab have been assessing the safety of machine learning systems for real-world uses like economic forecasting, cultural politeness, and legal reasoning. In their work, a new focus area has emerged: The extent to which large language models (LLMs) are capable of cyberbullying behavior—and how to debug where needed.

Penn student-led research into LLM cyberbullying is a relevant imperative given the widespread availability of these models. People are increasingly using LLMs for personal, educational, and commercial purposes—and there are growing concerns about the safety of AI for public use. Such timely work reflects Penn’s strategic framework, In Principle and Practice, which highlights how the University is leading on great challenges, including AI research.

“The benefit of language models is that literally everybody can use them—and this is the thing that worries me, that literally everyone can use them,” says Eric Wong, an assistant professor in the Department of Computer and Information Science (CIS) in the School of Engineering and Applied Science (SEAS) and faculty lead for the Brachio Lab, which seeks to improve machine learning for societal benefit.

This area of AI research is especially pertinent, Wong continues, as cases emerge of LLMs encouraging self-harm.

“Every so often, there is someone that has these conversations with a language model, and then it sort of spirals into a darker area,” Wong says, “and eventually it culminates in the language model telling the user to start harming themselves.”

Evaluating the safety and effectiveness of LLMs can help researchers to patch faults that yield cyberbullying outcomes—and help to prevent it from occurring in the first place.

“To make sure that these don't hurt people or bully people,” Wong explains, “we need to be able to actually profile each individual model, which has different strengths and weaknesses—and that’s what this research does.”

Building a framework: evaluator agents

To probe LLMs for cyberbullying, Wong’s students use “evaluator agents.” These agents are a specific type of LLM designed to interact with, study, and mend potential bugs in the models being tested. Davis Brown, a first-year doctoral student in CIS and the Brachio Lab, developed this framework with the industrial demand for LLMs in mind.

“We will become reliant on AI models as they rapidly occupy a larger share of our economy—this fact alone demands systematic testing and deeper understanding,” Brown says. “And this kind of framework will do that.”

The agents, he says, are fed data in areas of interest, from market forecasting and legal reasoning to cultural understanding. Then, they generate complex questions for LLMs to answer, building a unique profile of each AI model’s capabilities.

This enables the Brachio Lab to systematically explore uncharted territory in the target model’s reasoning and decision-making processes.

“We now have the ability to generate hundreds of such questions that test these models,” Brown says.

Prompting, probing, and profiling AI models

As project lead for the Brachio Lab’s cyberbullying capability case study, Helen Jin, a doctoral student in CIS, put these evaluator agents to the test. She worked to generate profiles that reflect the diverse populations that use LLMs.

“Because cyberbullying is a case where you don’t really know what [LLMs] might be targeting, you need to cover all of the bases,” says Jin, whose research interests span AI trustworthiness, interpretability, and cognition.

To do this, Jin leveraged U.S. Census data—including age, occupation, state of residence, and socioeconomic status—to prompt LLMs with nuanced questions and study their responses based on these attributes.

“We want to generate some synthetic personas where if you look at the overall data set, you have something that would be representative of a real population,” Jin says.

She found that evaluator agents, which are LLMs themselves, produce adaptable, ever-evolving profiles of LLM cyberbullying capabilities—mirroring how models continuously learn about their users and crawl the web for new information. Moreover, Jin and her peers uncovered that some LLMs have blind spots in their reasoning abilities, which can lead to cyberbullying outcomes.

By using evaluator agents, they were able to better account for the diverse backgrounds, interests, and personalities of LLM users—all of which models are learning to target.

This approach, Jin says, is “a new way to evaluate models more comprehensively and dynamically instead of sticking to static, existing methods.” It can also provide insight into bridging the communication gap between human and machine.

“I think especially with [LLMs] being publicly available,” Jin explains, “you really want to develop trust between the user and the model.”

Interdisciplinary impact

Shreya Havaldar, a fourth-year doctoral student in CIS and a Brachio Lab member co-advised by Lyle Ungar, a professor in SEAS and the School of Arts & Sciences, collaborated with Jin by testing LLMs for cultural politeness and emotional capabilities.

This interdisciplinary work, informed by Havaldar’s expertise in cross-cultural natural language processing, was relevant to Jin’s cyberbullying study.

“Bullying detection, at its core, is a very subjective task,” Havaldar notes.

For example, one user could interpret a phrase used by an LLM as polite or neutral, but someone in a different cultural context could perceive the same phrase as blunt or rude.

“I do think this understanding of open-ended subjectivity is kind of crucial to cyberbullying detection as a whole,” Havaldar says.

Making AI helpful, not harmful

Next steps for Brachio Lab may include studying how conversation length, ingroup-outgroup dynamics, and human coding bias can influence LLM cyberbullying behavior, among other potential factors.

The adaptive nature of evaluator agents prepares them for these innovative frontiers.

“I think just the whole general approach of this evaluation shows that it’s really useful and actually practically able to be used in the real world,” Jin says. “So that's really exciting.”

Wong notes that his Ph.D. students benefit from experiential learning and practical research in the Brachio Lab. He finds joy in honing their unique interests and expertise.

“They do amazing things, and it’s way more than what I would ever be able to personally achieve myself,” Wong says.

For Jin and her peers, a core principle uniting their research is clear: That AI ought to improve the human condition, not work against it.

“If we can find a way to best quantify how these cyberbullying generations can happen and how we can mitigate them in the future,” Jin says, “I think that can really save people’s lives.”

Artificial intelligence touches disciplines across campus. In a limited spring profile series, Penn Today is highlighting innovative students at Penn who are adopting this technology in a variety of projects. To learn more about how members of the Penn community are pioneering the understanding and advancement of AI, visit the Penn AI website.

Credits

Psilocybin targets brain circuits to relieve chronic pain, depression

Penn Medicine researchers offer new insights into psilocybin’s ability to break the pain-depression cycle.

Four women street vendors sell shoes and footwear on a Delhi street.

Social Sciences

Women’s labor and political agency in Delhi

Rashi Sabherwal, a doctoral student in political science, explores how women engage politically in society in informal roles through her research in India.

Natural Sciences

Applying design to agriculture for a new approach to agroforestry

Penn researchers are leading a multidisciplinary land-use project that blends trees, livestock, and streamside buffers to turn New Bolton Center into a living lab for stewardship agriculture.

Technology

Helping robots work together to explore the Moon and Mars

Penn Engineers, NASA, and five other universities tested robotic systems designed to help unmanned explorers cooperate in the dunes of White Sands, New Mexico, paving the way for Moon and Mars exploration.

Building a framework: evaluator agents

Prompting, probing, and profiling AI models

Interdisciplinary impact

Making AI helpful, not harmful

Share this article

Credits

Writer

Photographer

More from

Psilocybin targets brain circuits to relieve chronic pain, depression

Women’s labor and political agency in Delhi

Applying design to agriculture for a new approach to agroforestry

Helping robots work together to explore the Moon and Mars