Human-Robot Commensality:
Bite Timing Prediction for Robot-Assisted Feeding in Groups

Jan Ondras*¹, Abrar Anwar*², Tong Wu*³, Fanjun Bu⁴, Malte Jung², Jorge Jose Ortiz³, Tapomayukh Bhattacharjee¹

¹Cornell University ²University of Southern California ³Rutgers University ⁴Cornell Tech

Paper Video Code Robot Experiment Code Data

Conference on Robot Learning (CoRL) 2022

Video

Abstract

We develop data-driven models to predict when a robot should feed during social dining scenarios. Being able to eat independently with friends and family is considered one of the most memorable and important activities for people with mobility limitations. While existing robotic systems for feeding people with mobility limitations focus on solitary dining, commensality, the act of eating together, is often the practice of choice. Sharing meals with others introduces the problem of socially appropriate bite timing for a robot, i.e. the appropriate timing for the robot to feed without disrupting the social dynamics of a shared meal. Our key insight is that bite timing strategies that take into account the delicate balance of social cues can lead to seamless interactions during robot-assisted feeding in a social dining scenario. We approach this problem by collecting a Human-Human Commensality Dataset (HHCD) containing 30 groups of three people eating together. We use this dataset to analyze human-human commensality behaviors and develop bite timing prediction models in social dining scenarios. We also transfer these models to human-robot commensality scenarios.

Human-Human Commensality Dataset

When a person relies on assisted feeding, meals require that patient and caregiver coordinate their behavior. To achieve this subtle cooperation, the people involved must be able to initiate, perceive, and interpret each other’s verbal and non-verbal behavior. When a person relies on assisted feeding, meals require that patient and caregiver coordinate their behavior.

Our goal in this work is to understand the rhythm and timing of this dance to enable an automated feeding assistant to be thoughtful of when it should feed the user in social dining settings.

To achieve this goal, we introduce a novel audio-visual Human-Human Commensality Dataset (HHCD) that captures human social eating behaviors.

The dataset contains multi-view RGBD video and directional audio recordings of 30 groups of three people sharing a meal and is richly annotated with when people pick up their food, begin to intiate a bite, etc. This data collection was approved by the Cornell IRB, and the participants consented on having their videos recorded.

Videos of the group of three people eating are below.

Social Bite Timing Model

To capture the subtle inter-personal social dynamics in human-human and human-robot groups, we introduce SOcial Nibbling NETwork (SoNNET) for predicting bite timing in social-dining scenarios.

SoNNET is a new model architecture which follows a multi-channel pattern allowing multiple interconnected branches to interleave and fuse at different stages.

We create input processing channels for each diner, then add interleaving tunnels between each convolutional module and adjacent branches. The information capturing visually-observable behaviors between the diners is allowed to flow between the frames and channels. We conjecture that our model will learn a socially-coherent structure, allowing the model to implicitly represent the diners in an embedding space.

Above is an image of Triplet-SoNNET, which takes in features from all three diners. Since our goal is to feed people with mobility limitations while they are engaged in social conversations, we cannot use all these features. The features from someone self-feeding are inherently different from someone using a robot-assisted feeding system.

Therefore we also present Couplet-SoNNET (shown below) which restricts the user's features. More details on Couplet-SoNNET and Triplet-SoNNET as well as results are in the paper.

What can you do with HHCD?

The Human-Human Commensality Dataset (HHCD) was designed for predicting bite timing, but can provide a unique opportunity to investigate other tasks:

Human-Robot Commensality	Social Signal Processing
HHCD is well-positioned for various challenges relevant to the robot learning community to improve robot-assisted feeding in social dining settings. Some potential tasks could include: Bite timing prediction in social settings Head orientation prediction for feeding	Since HHCD is a relatively large multimodal dataset consisting of natural interactions of triadic groups, it could be used for various tasks within the social signal processing literature, such as Next speaker prediction Nonverbal gesture prediction

What can you find in HHCD?

HHCD has many features relevant for any kind of task. We list what we have already computed in our dataset:

Features	Available Representations	Collection Methodology
Raw Data	directional audio per-participant RGBD videos RGBD scene video	Collected from the our recording setup during data collection
Processed Audio	sound direction	Computed from our Respeaker 4-mic array
	global speaking status per-participant speaking status	Computed using webrtcvad, and combined with sound direction to compute per-particpant speaking status
	MFCC log-mel spectrogram log filter bank coefficients	Computed using python-speech-features
Processed Video	2D body keypoints 2D face keypoints	Processed using OpenPose
Processed Video	gaze direction head pose	Computed using RT-GENE
Bite-related Features (per-participant)	bite count time since last bite time since last bite time since food item lifted time since last bite delivered to mouth	Manually annotated
Misc. Features	interactions with food, drink, and napkins food type labels observations of interesting behaviors	Manually annotated

BibTeX

@article{ondras2022human,
  author     = {Jan Ondras and Abrar Anwar and Tong Wu and Fanjun Bu and Malte Jung and Jorge J. Ortiz and Tapomayukh Bhattacharjee},
  title      = {Human-Robot Commensality: Bite Timing Prediction for Robot-Assisted Feeding in Groups},
  booktitle  = {Conference on Robot Learning (CoRL)},
  year       = {2022},
  url        = {https://arxiv.org/abs/2107.12514}

}