CareEval: Evaluating Large Language Models for Decision-Making in Physical Robot Caregiving

1Cornell University    2Columbia University
*Indicates Equal Contribution
CareEval teaser: a robot consults an LLM for sit-to-stand assistance

A robot consults an LLM for sit-to-stand assistance. Even when provided with relevant functional details, the model proposes sit-to-stand steps that overlook user function, communication, agency, and functional safety. CareEval evaluates this domain-specific reasoning gap by testing whether LLMs can produce expert-aligned guidance for real physical caregiving tasks.

Abstract

We present CareEval, a benchmark for evaluating the physical caregiving decision-making abilities of Large Language Models. Developed with a licensed occupational therapist expert in caregiving and validated by eight clinical stakeholders, it contains 88 realistic scenarios spanning all six basic Activities of Daily Living. Instead of testing general reasoning, CareEval assesses whether model responses account for key caregiving factors, such as user function, agency, intent, communication, and safety, and align with expert practice. Across several state-of-the-art LLMs, the best model reaches only 53.1% accuracy, revealing substantial gaps in current models' ability to reason about physical caregiving.

Model performance on CareEval

Model performance on CareEval. Top: overall and ADL-specific performance, including worst-of-ten runs for safety-critical evaluation. Bottom: counts of severe and mild errors. Models vary widely across tasks, and higher overall performance does not necessarily correspond to fewer high-impact errors.

Example CareEval Scenario

Scenario

A 45-year-old adult with C5 Spinal cord injury has limited manipulation skills and upper arm spasticity on the right side. He wants to be independent but needs assistance to put on a button-down shirt.

What are your actions?

Score
Response
Function
Agency
Risk
+2
Bunch the sleeve and thread it up the right arm, limit elbow and shoulder movement, pass the shirt from the back, and encourage the person to reach and put on the other sleeve.
2
2
2
+1
Begin by threading the sleeve from the right, respecting elbow and shoulder motion, then help the person put on the other sleeve.
2
1
2
+0
Begin by threading the sleeve from the right, pass the shirt from the back, then put on the other sleeve.
2
1
1
-1
Dress the person and watch for issues of spasticity.
1
1
1
-2
Dress the person as fast as possible to avoid discomfort.
0
0
0

Scenario Analysis

In planning caregiving approaches, models tend to struggle to create a tailored strategy for each client that encompasses user function, user agency, communication and intent, and functional safety.

Scenario Keywords

Client with C5 SCI, dressing a pullover shirt in a wheelchair, universal cuff and dressing stick.

Paraphrased Response

Pause dressing attempt and acknowledge client's frustration. Demonstrate a modified dressing technique: put the shirt face down across his thighs, use the universal cuff and dressing stick to fully expand the neck opening, and anchor it across his knees. From there, thread his stronger arm through its sleeve first and guide the garment overhead using a chin-tuck motion and the dressing stick.

Expert Verdict

The model acknowledges the client's frustration (communication/intent) and selects a safe dressing technique (functional safety) based on the client's physical abilities (user function). The model lacks specific strategies for the client to use the dressing stick, the universal cuff, to gain control of the dressing task (lack of user agency).

BibTeX

@inproceedings{careeval2026,
  title={CareEval: Evaluating Large Language Models for Decision-Making in Physical Robot Caregiving},
  author={Liu, Ziang and Dimitropoulou, Katherine and Cheung, Christy and Bhattacharjee, Tapomayukh},
  booktitle={Proceedings of ACM/IEEE International Conference on Human-Robot Interaction (HRI '26)},
  year={2026},
  publisher={ACM},
  address={New York, NY, USA}
}