Image: Jessica Kourkounis / Stringer via Getty Images
2 min. read
Researchers at the School of Engineering and Applied Science have developed SmartDJ, an AI-powered editor that lets users modify immersive audio environments with simple instructions in everyday language, with potential applications in virtual reality, augmented reality, gaming and sound design. Instead of requiring users to specify individual edits, SmartDJ can respond to high-level requests like “make this sound like a busy office,” then plan and carry out the steps needed to achieve that result.
The system addresses two major limitations of earlier AI audio-editing tools. First, most prior systems worked best with rigid, template-like commands, requiring users to identify sounds to add or remove. Second, those tools generally operated on single-channel or “mono” audio, losing the spatial cues that are necessary for an immersive audio experience.
SmartDJ, by contrast, can interpret high-level instructions and is designed for stereo audio, allowing it to make edits that better preserve or reshape the spatial structure of a scene.
What’s more, the system is interpretable: Users can see each step SmartDJ takes. “With SmartDJ, users can describe the outcome they want in natural language, and the system figures out how to make it happen,” says Mingmin Zhao, assistant professor in computer and information science (CIS) and senior author of a study presented at the 2026 International Conference on Learning Representations. “We show that AI can help people edit audio in intuitive ways using simple language.”
One of the central challenges of AI audio editing is that understanding a user’s request and generating sounds are usually handled by different kinds of AI systems: language models and diffusion models. To bridge the gap, the team introduced an audio language model, or ALM, into the editing loop. Trained on both sound and text, the ALM analyzes the original audio together with the user’s prompt, then breaks that prompt into a sequence of smaller editing actions, such as adding, removing, or repositioning a sound. A diffusion model then carries out those actions step by step, allowing SmartDJ to both interpret language and edit audio.
In essence, the language model acts as a producer, deciding how the soundscape should change, while the diffusion model acts like a studio musician, carrying out those directions in audio. “The language model gives the system direction,” says Yiduo Hao, a doctoral student in CIS and the study’s other co-author. “The diffusion model performs those directions.”
Read more at Penn Engineering.
Ian Scheffler
Image: Jessica Kourkounis / Stringer via Getty Images
(Image: Lance Nelson)
Image: shih-wei via Getty Images
A bioengineered bean gum from the lab of Penn Dental’s Henry Daniell is found to reduce the levels of three microbes associated with head and neck squamous cell cancer to almost zero, without affecting the beneficial bacteria normally found in the mouth.
(Image: Kevin Monko/Penn Dental Medicine)