Abstract
We present CareEval, a benchmark for evaluating the physical caregiving decision-making abilities of Large Language Models. Developed with a licensed occupational therapist expert in caregiving and validated by eight clinical stakeholders, it contains 88 realistic scenarios spanning all six basic Activities of Daily Living. Instead of testing general reasoning, CareEval assesses whether model responses account for key caregiving factors, such as user function, agency, intent, communication, and safety, and align with expert practice. Across several state-of-the-art LLMs, the best model reaches only 53.1% accuracy, revealing substantial gaps in current models' ability to reason about physical caregiving.
Model performance on CareEval. Top: overall and ADL-specific performance, including worst-of-ten runs for safety-critical evaluation. Bottom: counts of severe and mild errors. Models vary widely across tasks, and higher overall performance does not necessarily correspond to fewer high-impact errors.
Example CareEval Scenario
Scenario
A 45-year-old adult with C5 Spinal cord injury has limited manipulation skills and upper arm spasticity on the right side. He wants to be independent but needs assistance to put on a button-down shirt.
What are your actions?
Scenario Analysis
In planning caregiving approaches, models tend to struggle to create a tailored strategy for each client that encompasses user function, user agency, communication and intent, and functional safety.
BibTeX
@inproceedings{careeval2026,
title={CareEval: Evaluating Large Language Models for Decision-Making in Physical Robot Caregiving},
author={Liu, Ziang and Dimitropoulou, Katherine and Cheung, Christy and Bhattacharjee, Tapomayukh},
booktitle={Proceedings of ACM/IEEE International Conference on Human-Robot Interaction (HRI '26)},
year={2026},
publisher={ACM},
address={New York, NY, USA}
}