We develop data-driven models to predict when a robot should feed during social dining scenarios. Being able to eat independently with friends and family is considered one of the most memorable and important activities for people with mobility limitations. While existing robotic systems for feeding people with mobility limitations focus on solitary dining, commensality, the act of eating together, is often the practice of choice. Sharing meals with others introduces the problem of socially appropriate bite timing for a robot, i.e. the appropriate timing for the robot to feed without disrupting the social dynamics of a shared meal. Our key insight is that bite timing strategies that take into account the delicate balance of social cues can lead to seamless interactions during robot-assisted feeding in a social dining scenario. We approach this problem by collecting a Human-Human Commensality Dataset (HHCD) containing 30 groups of three people eating together. We use this dataset to analyze human-human commensality behaviors and develop bite timing prediction models in social dining scenarios. We also transfer these models to human-robot commensality scenarios.
When a person relies on assisted feeding, meals require that patient and caregiver coordinate their behavior. To achieve this subtle cooperation, the people involved must be able to initiate, perceive, and interpret each other’s verbal and non-verbal behavior. When a person relies on assisted feeding, meals require that patient and caregiver coordinate their behavior.
Our goal in this work is to understand the rhythm and timing of this dance to enable an automated feeding assistant to be thoughtful of when it should feed the user in social dining settings.
To achieve this goal, we introduce a novel audio-visual Human-Human Commensality Dataset (HHCD) that captures human social eating behaviors.
The dataset contains multi-view RGBD video and directional audio recordings of 30 groups of three people sharing a meal and is richly annotated with when people pick up their food, begin to intiate a bite, etc. This data collection was approved by the Cornell IRB, and the participants consented on having their videos recorded.
Videos of the group of three people eating are below.
To capture the subtle inter-personal social dynamics in human-human and human-robot groups, we introduce SOcial Nibbling NETwork (SoNNET) for predicting bite timing in social-dining scenarios.
SoNNET is a new model architecture which follows a multi-channel pattern allowing multiple interconnected branches to interleave and fuse at different stages.
We create input processing channels for each diner, then add interleaving tunnels between each convolutional module and adjacent branches. The information capturing visually-observable behaviors between the diners is allowed to flow between the frames and channels. We conjecture that our model will learn a socially-coherent structure, allowing the model to implicitly represent the diners in an embedding space.
Above is an image of Triplet-SoNNET, which takes in features from all three diners. Since our goal is to feed people with mobility limitations while they are engaged in social conversations, we cannot use all these features. The features from someone self-feeding are inherently different from someone using a robot-assisted feeding system.
Therefore we also present Couplet-SoNNET (shown below) which restricts the user's features. More details on Couplet-SoNNET and Triplet-SoNNET as well as results are in the paper.
Human-Robot Commensality | Social Signal Processing |
---|---|
HHCD is well-positioned for various challenges relevant to the robot learning community to improve robot-assisted feeding in social dining settings.
Some potential tasks could include:
|
Since HHCD is a relatively large multimodal dataset consisting of natural interactions of triadic groups, it could be used for various tasks within
the social signal processing literature, such as
|
Features | Available Representations | Collection Methodology |
---|---|---|
Raw Data | directional audio per-participant RGBD videos RGBD scene video |
Collected from the our recording setup during data collection |
Processed Audio | sound direction | Computed from our Respeaker 4-mic array |
global speaking status per-participant speaking status |
Computed using webrtcvad, and combined with sound direction to compute per-particpant speaking status | |
MFCC log-mel spectrogram log filter bank coefficients |
Computed using python-speech-features | |
Processed Video | 2D body keypoints 2D face keypoints |
Processed using OpenPose |
gaze direction head pose |
Computed using RT-GENE | |
Bite-related Features (per-participant) |
bite count time since last bite time since last bite time since food item lifted time since last bite delivered to mouth |
Manually annotated |
Misc. Features | interactions with food, drink, and napkins food type labels observations of interesting behaviors |
Manually annotated |
@article{ondras2022human,
author = {Jan Ondras and Abrar Anwar and Tong Wu and Fanjun Bu and Malte Jung and Jorge J. Ortiz and Tapomayukh Bhattacharjee},
title = {Human-Robot Commensality: Bite Timing Prediction for Robot-Assisted Feeding in Groups},
booktitle = {Conference on Robot Learning (CoRL)},
year = {2022},
url = {https://arxiv.org/abs/2107.12514}
}