SmartDJ lets users reshape audio experiences with simple words

Researchers at the School of Engineering and Applied Science have developed SmartDJ, an AI-powered editor that lets users modify immersive audio environments with simple instructions in everyday language, with potential applications in virtual reality, augmented reality, gaming and sound design. Instead of requiring users to specify individual edits, SmartDJ can respond to high-level requests like “make this sound like a busy office,” then plan and carry out the steps needed to achieve that result.

The system addresses two major limitations of earlier AI audio-editing tools. First, most prior systems worked best with rigid, template-like commands, requiring users to identify sounds to add or remove. Second, those tools generally operated on single-channel or “mono” audio, losing the spatial cues that are necessary for an immersive audio experience.

SmartDJ, by contrast, can interpret high-level instructions and is designed for stereo audio, allowing it to make edits that better preserve or reshape the spatial structure of a scene.

What’s more, the system is interpretable: Users can see each step SmartDJ takes. “With SmartDJ, users can describe the outcome they want in natural language, and the system figures out how to make it happen,” says Mingmin Zhao, assistant professor in computer and information science (CIS) and senior author of a study presented at the 2026 International Conference on Learning Representations. “We show that AI can help people edit audio in intuitive ways using simple language.”

One of the central challenges of AI audio editing is that understanding a user’s request and generating sounds are usually handled by different kinds of AI systems: language models and diffusion models. To bridge the gap, the team introduced an audio language model, or ALM, into the editing loop. Trained on both sound and text, the ALM analyzes the original audio together with the user’s prompt, then breaks that prompt into a sequence of smaller editing actions, such as adding, removing, or repositioning a sound. A diffusion model then carries out those actions step by step, allowing SmartDJ to both interpret language and edit audio.

In essence, the language model acts as a producer, deciding how the soundscape should change, while the diffusion model acts like a studio musician, carrying out those directions in audio. “The language model gives the system direction,” says Yiduo Hao, a doctoral student in CIS and the study’s other co-author. “The diffusion model performs those directions.”

Credits

Writer

Ian Scheffler

More from

School of Engineering & Applied Science

Artificial Intelligence

Computer Science

Faculty

Graduate Students

Recent Articles

Two people looking at the flooded highway overpass in Philadelphia after flooding from Hurricane Ida.

Natural Sciences

When the Schuylkill swallowed the city: Lessons from Hurricane Ida’s historic flood

New Penn research shows that Hurricane Ida wasn’t a once-in-a-century anomaly but a preview of how climate change, urbanization, and aging infrastructure are rewriting flood risk.

Business & Law

The Fed explained: What it does and why it matters

Former Philadelphia Fed President Patrick Harker and financial historian Peter Conti-Brown, both Wharton professors, unpack the central bank’s origins, its unusual structure, and the quiet ways it shapes the economy

Schoolchildren lining up to go into a classroom.

Social Sciences

How population changes are impacting primary education worldwide

Research from Penn sociologist Emily Hannum and colleagues reveals regional trends in whether school-age populations are increasing, plateauing, or decreasing—and shows how different countries are responding.

A latex-gloved hand hoding a petri dish of medical chewing gum.

Health & Medicine

Fighting oral cancer with bioengineered chewing gum

Research led by Penn Dental’s Henry Daniell shows that antiviral and antibacterial chewing gums reduce the levels of three microbes linked to worse outcomes in oral cancers, paving the way for more effective and affordable therapies.

Share this article