A framework for low-latency, LLM-driven multimodal interaction on the Pepper Robot

dc.contributor.authorStuderus, Erich
dc.contributor.authorZhong, Jia
dc.contributor.authorVonschallen, Stephan
dc.date.accessioned2026-06-15T11:43:17Z
dc.date.issued2026
dc.description.abstractDespite recent advances in integrating Large Language Models (LLMs) into social robotics, two weaknesses persist. First, existing implementations on platforms like Pepper often rely on cascaded Speech-to-Text (STT)→LLM→Text-to-Speech (TTS) pipelines, resulting in high latency and the loss of paralinguistic information. Second, most implementations fail to fully leverage the LLM’s capabilities for multimodal perception and agentic control. We present an open-source Android framework for the Pepper robot that addresses these limitations through two key innovations. First, we integrate end-to-end Speech-to-Speech (S2S) models to achieve low-latency interaction while preserving paralinguistic cues and enabling adaptive intonation. Second, we implement extensive Function Calling capabilities that elevate the LLM to an agentic planner, orchestrating robot actions (navigation, gaze control, tablet interaction) and integrating diverse multimodal feedback (vision, touch, system state). The framework runs on the robot’s tablet but can also be built to run on regular Android smartphones or tablets, decoupling development from robot hardware. This work provides the HRI community with a practical, extensible platform for exploring advanced LLM-driven embodied interaction.
dc.eventHRI '26. 21st ACM/IEEE International Conference on Human-Robot Interaction
dc.event.end2026-03-19
dc.event.start2026-03-16
dc.identifier.doi10.1145/3757279.3788808
dc.identifier.isbn979-8-4007-2128-1
dc.identifier.urihttps://irf.fhnw.ch/handle/11645/57036
dc.identifier.urihttps://doi.org/10.26041/fhnw-16501
dc.language.isoen
dc.publisherAssociation for Computing Machinery
dc.relation.ispartofHRI '26. Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.spatialEdinburgh
dc.subject.ddc620 - Ingenieurwissenschaften und Maschinenbau
dc.titleA framework for low-latency, LLM-driven multimodal interaction on the Pepper Robot
dc.type04B - Beitrag Konferenzschrift
dspace.entity.typePublication
fhnw.InventedHereYes
fhnw.ReviewTypenot peer-reviewed
fhnw.openAccessCategoryGold
fhnw.pagination1298-1302
fhnw.publicationStatePublished
fhnw.targetcollectiond40e4c67-dd87-4d14-8518-b2f0a855e750
relation.isAuthorOfPublicationdb104e31-d8a7-4def-ac80-3e392e1fd175
relation.isAuthorOfPublicationff69a9ff-aabe-477a-bdde-e900fee2f7e0
relation.isAuthorOfPublication.latestForDiscoverydb104e31-d8a7-4def-ac80-3e392e1fd175
Dateien

Originalbündel

Gerade angezeigt 1 - 1 von 1
Lade...
Vorschaubild
Name:
3757279.3788808.pdf
Größe:
3.26 MB
Format:
Adobe Portable Document Format

Lizenzbündel

Gerade angezeigt 1 - 1 von 1
Lade...
Vorschaubild
Name:
license.txt
Größe:
2.66 KB
Format:
Item-specific license agreed upon to submission
Beschreibung: