Using AI to map research in the School of Arts & Sciences

Colin Twomey of the Data Driven Discovery Initiative applied a large language model to create a color-coded, interactive map of publications from current SAS faculty.

Screenshot of University Atlas Project visualization.
Colin Twomey created the University Atlas Project, showing the thematic commonalities between research publications from current faculty in the School of Arts & Sciences at the University of Pennsylvania. Viewers can search for publications by department, faculty member, program affiliation, keyword, or year. The map can be viewed at uatlas.com/penn/sas

When Colin Twomey became interim executive director of the Data Driven Discovery Initiative (DDDI) last summer, he says, his background in behavioral ecology meant that he had a good idea of the data science needs for his own field and some idea for biology, genetics, and evolution. However, with DDDI serving as the hub for data science education and research across the School of Arts & Sciences, Twomey says he found his understanding of the needs for chemistry, sociology, and other fields to be lacking.

To tackle the problem, he followed his instinct as an ecologist: map out the system and get a big-picture view before digging into the details. What resulted is a work-in-progress map intended to capture all published research by current faculty in SAS, including their work before coming to Penn, encompassing research that spans several decades. It uses the same technology as ChatGPT and similar large language models (LLMs).

“I really think of it as like a Google Maps for research. It gives you a very fast way to get oriented to a really big and complex research environment like Penn,” Twomey says. He built what he calls the University Atlas Project, or uAtlas for short, during his personal time, and it’s just one of the ways Penn is leading in data-driven research, teaching, and applications.

At first glance, it might look like a single-cell atlas to a scientist or an abstract design to an artist. While the map is still being worked on, each of the more than 40,000 dots is a different publication by a professor—color-coded by their department—and zooming in shows labels for 240 topics. Departments are assigned a specific color. Red is economics. Highlighter-orange is chemistry. Pastel yellow is psychology. Robin’s-egg blue is Africana studies. Hot pink is cinema and media studies and so forth.

Labels on uAtlas map that show themes across disciplines.
The multicolored pattern of dots in certain areas of uAtlas shows how researchers across disciplines are working on similar issues.

The spatial arrangement shows how thematically similar each paper is in relation to another and illustrates the interdisciplinary pursuits of Penn faculty. “There’s all sorts of really unexpected overlaps, and it also doesn’t put anyone into a box,” Twomey says.

The Department of Physics and Astronomy shows up as very broad, Twomey says. “It has its tendrils into everything, which is kind of amazing; it really does accommodate a very broad range of interests, from social sciences and psychology to chemistry.”

The multicolored pattern of dots around labels such as inequality, bioethical dilemmas, and COVID-19 impact show how researchers in psychology, sociology, political science, philosophy, economics, Africana studies, and more are leading on the great challenges of our time.

The map is also searchable by name, which shows the varied interests and cross-disciplinary work of Penn faculty. For example, the spread-out clusters for physics professor Vijay Balasubramanian reflect his interests in string theory and neuroscience.

Users can also adjust the view to show only works published before or after a selected year. Twomey was struck by a bridge of green dots, for earth and environmental science, connecting hard science subjects—and specifically the topic of “past climate variability”—to the social sciences. The bridge labeled “climate communication,” Twomey says, didn’t start appearing until after about 2004, pointing to research led by Michael Mann.

Climate communication "bridge" on uAtlas.
Colin Twomey noticed that a bridge of green dots—for earth and environmental science—connects the topic of “past climate variability” to the social sciences. The bridge has been labeled “climate communication” and is primarily the research of Michael Mann. Hovering over a single dot on the map pulls up the corresponding publication.

Twomey says the tool has been useful to him in identifying what is going on in different departments. And he says it can also help faculty identify potential collaborators and prospective graduate students and postdocs determine with whom they want to work. “My other hope for this is that, once you do this for long enough, you get these pictures of where the University is evolving over time, where research has moved,” Twomey says.

Bhuvnesh Jain, the Walter H. and Leonore C. Annenberg Professor in the Natural Sciences and faculty co-director of DDDI, says he loves that Twomey’s map is both sophisticated—in its use of an LLM to “embed” research papers onto an abstract space—and visually intuitive.

“The map transcends discipline and sub-discipline labels and shows how closely connected a lot of our work is,” Jain says, adding that he had fun brainstorming with Twomey on the applications of this tool. “I am confident that the users will range from incoming Penn undergraduates to the deans of our schools, who will be able to rapidly visualize the hubs of activity, the interconnections of different research efforts, and the growth areas in different fields.”

To build this map, Twomey began by figuring out the affiliations of SAS faculty, which he says was a challenge because the data live in many places across the University. He then used Python to distill the data and a large language model to map the semantic content of each publication into a high-dimensional embedding space. But Twomey says visualizing hundreds of dimensions simultaneously is impractical, so the final map compresses data into a two-dimensional representation that best preserves the relationships between papers that address similar topics.

He next used the programming language Elixir to build a custom web server so the map would appear on a user-friendly website. Twomey then used an LLM again to add the research topics, choosing a labeling system that he felt was neither too dense nor too sparse, so it’s “not overwhelming but still gives you enough waypoints.”

To date, the map captures most but not all School of Arts & Sciences faculty as Twomey continues to work on the project. He also notes that some data from indexes like Google Scholar and OpenAlex may be incorrect, meaning a professor may show up as incorrectly attached to a paper or the year is wrong, so additional validation is needed. Twomey’s goal is to eventually include research from graduate students and postdocs as well and to expand beyond SAS.

“The School of Arts and Sciences has 28 departments and 34 centers, and seeing how all those intersect is super fascinating, but that’s just one piece, one school,” Twomey says. “I want to have this Penn-wide and even scale it beyond Penn in the future.”