The ability to disentangle and correctly recognize the speech of a single speaker among other speakers (the well-known cocktail party effect) is paramount for effective speech interaction. As such, it is an extremely useful feature for human-robot interaction.
Based on monoaural audio and event-driven video we achieve on pair results to the frame-based approach using Deep Learning techniques. We compute the optical flow over the event-based video extract the direction and amplitude of the motion of each event. Besides, we were able to reduce latency by a third compared to the frame-based approach. However, although we were able to reduce the latency, it is still too high to be used in real environments. Thus, now we are working on a pure neuromorphic approach. For the audio we use a neuromorphic cochlea and for the video event-driven cameras. Besides, we’re moving from “classic” Deep Learning techniques to Spiking Neural Network to reduce the latency of the system and make it usable in real human-robot interaction environments.
Bidirectional Long short-term memory (BiLSTM), Event-Driven Optical Flow, Dynamic Vision Sensor, Neuromorphic Auditory Sensor, Spiking Generative Adversarial Network (Spiking GAN).
Arriandiaga A., Morrone G., Pasa L., Badino L., Bartolozzi C. Audio-visual target speaker enhancement on multi-talker environment using event-driven cameras, Proceedings - IEEE International Symposium on Circuits and Systems