Published:
2025, 2025 IEEE International Conference on Development and Learning (ICDL), Piscataway, IEEE Conference Publications), p. 1-6), ISBN 979-8-3315-4343-3
Annotation:
Human communication consists of multimodal signals, such as speech and gestures. These signals are not always aligned, making it difficult for artificial systems (e.g. robots) to find the proper mapping between a particular gesture and the corresponding part of a spoken instruction. The goal of our study is to identify whether and how declarative gestures during human-robot interaction are temporally synchronized with specific segments of language instructions. We conducted an experiment focused on this phenomenon, in which 26 participants taught a humanoid robot using declarative gestures and verbal instructions. The experiment was carried out in a virtual reality (VR) environment that allowed a precise capture of human movements. Gesture trajectories were annotated for the onset, peak and offset events, and statistically compared with the onset, matching language part, and offset of the instruction. The results indicate that there are significant differences between the speech and gesture onset times (W=348,495, p<0.001), with an average temporal difference of 0.56 ± 1.30 seconds, as well as between the gesture peaks and the matching language peaks (W=287,006,p<0.001), with an average difference of 0.66±1.25 seconds. Furthermore, the total duration of both signals differs significantly (W=48,672,p<0.001, with gestures lasting longer than speech. The analysis of the data distributions further revealed that even though both signals differ in absolute timing, there exists a correlation between specific key points: onset: 0.644 (p<0.001), peak: 0.646(p<0.001). These findings suggest that humans synchronize gestures with language instructions at a relational level rather than an absolute level. The findings can be applied to the design of multimodal interfaces for humanoid robots, which should help them understand better human instructions.