Human-Robot Commensality:
Bite Timing Prediction for Robot-Assisted Feeding in Groups

1Cornell University 2University of Southern California 3Rutgers University 4Cornell Tech
Conference on Robot Learning (CoRL) 2022

Video

Abstract

We develop data-driven models to predict when a robot should feed during social dining scenarios. Being able to eat independently with friends and family is considered one of the most memorable and important activities for people with mobility limitations. While existing robotic systems for feeding people with mobility limitations focus on solitary dining, commensality, the act of eating together, is often the practice of choice. Sharing meals with others introduces the problem of socially appropriate bite timing for a robot, i.e. the appropriate timing for the robot to feed without disrupting the social dynamics of a shared meal. Our key insight is that bite timing strategies that take into account the delicate balance of social cues can lead to seamless interactions during robot-assisted feeding in a social dining scenario. We approach this problem by collecting a Human-Human Commensality Dataset (HHCD) containing 30 groups of three people eating together. We use this dataset to analyze human-human commensality behaviors and develop bite timing prediction models in social dining scenarios. We also transfer these models to human-robot commensality scenarios.

Interpolate start reference image.

Human-Human Commensality Dataset

When a person relies on assisted feeding, meals require that patient and caregiver coordinate their behavior. To achieve this subtle cooperation, the people involved must be able to initiate, perceive, and interpret each other’s verbal and non-verbal behavior. When a person relies on assisted feeding, meals require that patient and caregiver coordinate their behavior.

Our goal in this work is to understand the rhythm and timing of this dance to enable an automated feeding assistant to be thoughtful of when it should feed the user in social dining settings.

To achieve this goal, we introduce a novel audio-visual Human-Human Commensality Dataset (HHCD) that captures human social eating behaviors.

The dataset contains multi-view RGBD video and directional audio recordings of 30 groups of three people sharing a meal and is richly annotated with when people pick up their food, begin to intiate a bite, etc. This data collection was approved by the Cornell IRB, and the participants consented on having their videos recorded.

Videos of the group of three people eating are below.


Social Bite Timing Model

To capture the subtle inter-personal social dynamics in human-human and human-robot groups, we introduce SOcial Nibbling NETwork (SoNNET) for predicting bite timing in social-dining scenarios.

Interpolate start reference image.

SoNNET is a new model architecture which follows a multi-channel pattern allowing multiple interconnected branches to interleave and fuse at different stages.

We create input processing channels for each diner, then add interleaving tunnels between each convolutional module and adjacent branches. The information capturing visually-observable behaviors between the diners is allowed to flow between the frames and channels. We conjecture that our model will learn a socially-coherent structure, allowing the model to implicitly represent the diners in an embedding space.

Above is an image of Triplet-SoNNET, which takes in features from all three diners. Since our goal is to feed people with mobility limitations while they are engaged in social conversations, we cannot use all these features. The features from someone self-feeding are inherently different from someone using a robot-assisted feeding system.

Therefore we also present Couplet-SoNNET (shown below) which restricts the user's features. More details on Couplet-SoNNET and Triplet-SoNNET as well as results are in the paper.

Interpolate start reference image.

What can you do with HHCD?

The Human-Human Commensality Dataset (HHCD) was designed for predicting bite timing, but can provide a unique opportunity to investigate other tasks:

Human-Robot Commensality Social Signal Processing
HHCD is well-positioned for various challenges relevant to the robot learning community to improve robot-assisted feeding in social dining settings. Some potential tasks could include:
  • Bite timing prediction in social settings
  • Head orientation prediction for feeding
  • Since HHCD is a relatively large multimodal dataset consisting of natural interactions of triadic groups, it could be used for various tasks within the social signal processing literature, such as
  • Next speaker prediction
  • Nonverbal gesture prediction

  • What can you find in HHCD?

    HHCD has many features relevant for any kind of task. We list what we have already computed in our dataset:

    Features Available Representations Collection Methodology
    Raw Data directional audio
    per-participant RGBD videos
    RGBD scene video
    Collected from the our recording setup during data collection
    Processed Audio sound direction Computed from our Respeaker 4-mic array
    global speaking status
    per-participant speaking status
    Computed using webrtcvad, and combined with sound direction to compute per-particpant speaking status
    MFCC
    log-mel spectrogram
    log filter bank coefficients
    Computed using python-speech-features
    Processed Video 2D body keypoints
    2D face keypoints
    Processed using OpenPose
    gaze direction
    head pose
    Computed using RT-GENE
    Bite-related Features
    (per-participant)
    bite count
    time since last bite
    time since last bite
    time since food item lifted
    time since last bite delivered to mouth
    Manually annotated
    Misc. Features interactions with food, drink, and napkins
    food type labels
    observations of interesting behaviors
    Manually annotated

    BibTeX

    @article{ondras2022human,
      author     = {Jan Ondras and Abrar Anwar and Tong Wu and Fanjun Bu and Malte Jung and Jorge J. Ortiz and Tapomayukh Bhattacharjee},
      title      = {Human-Robot Commensality: Bite Timing Prediction for Robot-Assisted Feeding in Groups},
      booktitle  = {Conference on Robot Learning (CoRL)},
      year       = {2022},
      url        = {https://arxiv.org/abs/2107.12514}
    
    }