Navigating ‘information pollution’ with the help of artificial intelligence

Using insights from the field of natural language processing, computer scientist Dan Roth and his research group are developing an online platform that helps users find relevant and trustworthy information about the novel coronavirus.

hands holding laptops and phone screens with text saying outbreak, stay home, lockdown, and covid-19 and images of the virus

There’s still a lot that’s not known about the novel coronavirus SARS-CoV-2 and COVID-19, the disease it causes. What leads some people to have mild symptoms and others to end up in the hospital? Do masks help stop the spread? What are the economic and political implications of the pandemic? 

As researchers try to address many of these questions, many of which will not have a simple ‘yes or no’ answer, people are also trying to figure out how to keep themselves and their families safe. But between the 24-hour news cycle, hundreds of preprint research articles, and guidelines that vary between regional, state, and federal governments, how can people best navigate through such vast amounts of information? 

Using insights from the field of natural language processing and artificial intelligence, computer scientist Dan Roth and the Cognitive Computation Group are developing an online platform to help users find relevant and trustworthy information about the novel coronavirus. As part of a broader effort by his group to develop tools for navigating “information pollution,” this platform is devoted to identifying the numerous perspectives that a single query might have, showing the evidence that supports each perspective and organizing results, along with each source’s “trustworthiness,” so users can better understand what is known, by whom, and why. 

Creating these types of automated platforms represents a huge challenge for researchers in the field of natural language processing and machine learning because of the complexity of human language and communication. “Language is ambiguous. Every word, depending on context, could mean completely different things,” says Roth. “And language is variable. Everything you want to say, you can say in different ways. To automate this process, we have to get around these two key difficulties, and this is where the challenge is coming from.”

Thanks to numerous conceptual and theoretical advances, the Cognitive Computational Group’s fundamental research in natural language understanding has allowed them to apply their research insights and to develop automated systems that can better understand the contents of human language, such as what is being written about in a news article or scientific paper. Roth and his team have been working on issues related to information pollution for many years and are now applying what they’ve learned to information about the novel coronavirus. 

Information pollution comes in many forms, including biases, misinformation, and disinformation, and because of the sheer volume of information the process of sorting fact from fiction needs automated support. “It’s very easy to publish information,” says Roth, adding that while organizations like FactCheck.org, a project of Penn’s Annenberg Public Policy Center, manually verify the validity of many claims, there’s not enough human power to fact check every claim being posted on the Internet. 

And fact checking alone isn’t enough to address all of the problems of information pollution, says Ph.D. student Sihao Chen. Take the question of whether people should wear face masks: “The answer to that question has changed dramatically in the past couple months, and the reason for that change is multi-faceted,” he says. “You couldn’t find an objective truth attached to that specific question, and the answer to that question is context-dependent. Fact checking alone doesn’t solve this problem because there’s no single answer.” This is why the team says that identifying various perspectives along with evidence that supports them is important. 

To help address both of these hurdles, the COVID-19 search platform visualizes results that include a source’s level of trustworthiness while also highlighting different perspectives. This is different from how online search engines display information, where top results are based on popularity and keyword match and where it’s not easy to see how the arguments in articles compare to one another. On this platform, however, instead of displaying articles on an individual basis, they are organized based on the claims they make. 

screenshot of penn information pollution project website, at the top is a search bar with topics including daily supply, death, diagnosis, ecology, economic implications, and more. the popular topic shown is "when is the vaccine for COVID-19 going to be available" and two perspectives are shown on the right
The landing page of the Information Pollution website. Search results are organized into three dimensions: article topic, category (such as news article or scientific study), and type (such as opinion piece or recommendation) and are grouped by a shared perspective. (Image: Penn Information Pollution Project)

“Search engines make a point not to touch the information and not to give suggestions and organize this material,” says Roth. The redundancy of information by itself is quite often misleading and leads to bias, since people tend to think that seeing something many times makes it more correct. “Here, if there are 500 articles that are saying the same thing, we cluster them together and say, ‘All these articles are quoting the same sources, so just focus on one of them. Then, these other articles are interviewing other people and making different claims, so you can sample from different clusters.’” 

When visiting the website, users can enter a question, claim, or topic into the search bar, and results are grouped together based on the similarity of perspectives. Since everything is set up to be automated, the researchers are eager to share this first iteration of the platform with the community so they can improve the language-processing models. “It’s a community effort,” says Roth, adding that their platform was designed to be transparent and open source so that they can easily collaborate with others. 

Chen hopes that their efforts support both the users who are interested in sorting through COVID-19 information pollution as well as fellow researchers in the field of natural language processing. “We want to help everyone who’s interested in reading news like this, and at the same time we want to build better techniques to accommodate that need,” says Chen. 

portraits of dan roth, sihao chen, and Xander Uyttendaele
Member of Penn’s Information Pollution Project team (from left), professor Dan Roth, Ph.D. student Sihao Chen, and undergraduate research assistant Xander Uyttendaele. 

Dan Roth is the Eduardo D. Glandt Distinguished Professor in the Department of Computer and Information Science in the School of Engineering and Applied Science at the University of Pennsylvania

The online search platform is available on the Penn Information Pollution project website

Additional information and resources on COVID-19 are available at  https://coronavirus.upenn.edu/.