Gotchi Caretaking Under Sparse Instructions: Negative Results Across Contemporary LLMs


Abstract

We evaluate whether large language models (LLMs) can act as reliable long‑horizon caretakers in a simple text environment (“Gotchi”), where an ASCII pet has latent needs (e.g., hunger, boredom, fatigue) and produces surface wants (“pet speaks” hints) that may or may not reflect those needs. Under a timed baseline of 60 scheduled turns (one turn every two minutes for two hours) and a barebones ruleset (“respond every 2 minutes; no mechanics explained”), no tested modern model(128k context or greater)—covering Grok, GPT (excluding GPT‑5 due to recency), Claude, DeepSeek, and Kimi—exceeded 40 turns. In the case of older models outside of the 128k context scope (8k and 32k context models), no model exceeded 30 turns. No model discovered hidden or interconnected mechanics. Allowing models to articulate thoughts between runs (Run‑1 summary feeding into Run‑2) did see improvements of +5 to +7 turns but only in COT (chain of thought) models. More “empathetic” models did not translate empathy into better care. Across families, models systematically prioritized wants over needs. One outlier (GPT‑3.5‑turbo) appeared to find an “exploit,” but the behavior is likely an artifact (looping that emitted only game outputs). Taken together, these results argue strongly against deploying current LLMs in safety‑critical caretaking contexts (e.g., healthcare or insurance triage) where hidden mechanics, resource trade‑offs, and reliable schedule adherence are prerequisite.

Public Code & Video Breakdown:

Can be found here.

1. Background and Motivation

LLMs are often described as capable planners and empathetic assistants. Caretaking—especially under partial observability and hidden rules—is a stringent test of those claims. “Gotchi” offers a lightweight probe: an LLM plays the role of caregiver to a virtual pet with unobserved internal state; only actions and surface utterances are visible. The core scientific questions we target are: (i) Sustained attention over time, (ii) Latent‑rule discovery (can a model infer the hidden needs/action effects), and (iii) Alignment of empathy to effective care (do “kind” models act on needs rather than wants?).

2. Environment and Task

Environment. A text‑only “pet” with latent needs (e.g., hunger, boredom, fatigue).
Actions. Canonically, Feed, Play, and Sleep (plus trivial meta‑actions like Quit/Restart).
Observability. The model sees only natural‑language “pet speaks” events (“I want to play!”) and the game loop, not the hidden need values or the action‑to‑need mapping.
Objective. Persist as many turns as possible without letting needs drift into failure; ideally infer the mapping from actions to needs and plan accordingly.

3. Experimental Design

3.1 Model Families

Grok, GPT (multiple variants; GPT‑5 excluded due to recency), Claude, DeepSeek, and Kimi. One notable outlier behavior was observed in GPT‑3.5‑turbo (see §5.4).

3.2 Conditions

  • Timed Baseline (Primary): 60 turns, one every 2 minutes (two hours total). Rules given:  No mechanics explained.
  • Timed w/ Instruction: Same as above, but given all mechanic information to respond every 2 minutes.
  • Within‑Thread Reflection: After Run‑1, the model may articulate thoughts/summarize; Run‑2 begins with that summary prepended (“can AI improve within a single thread?”).

3.3 Outcome Measures

  • Longevity: number of turns completed before failure or derailment.
  • Mechanics Discovery: evidence that the model inferred the hidden mapping from actions to needs and used it consistently.
  • Within‑Thread Improvement: change in performance from Run‑1 to Run‑2 when given a self‑summary.
  • Empathy–Efficacy Relationship: qualitative assessment of “empathetic tone” vs. actual care quality.
  • Failure Taxonomy: schedule failures, want‑chasing, loops, disengagement.

4. Protocol

  1. Initialize the pet (hidden needs unknown to the model).
  2. Start the run in the specified condition.
  3. Enforce timing (where applicable) by prompting the model at 2‑minute intervals.
  4. Record model actions and any free‑text rationale.
  5. Terminate upon failure (game over), non‑responses/derailment, or looped outputs.
  6. For the reflection condition, prepend Run‑1 summary and repeat.

5. Results

5.1 Longevity (Timed Baseline)

  • No model exceeded 40 turns (target was 60 turns over 2 hours).
  • Failures clustered around missed timing by not calculating stat drops, want‑chasing that ignored accumulating needs, or oscillations that never corrected the dominant deficit.

5.2 Mechanics Discovery

  • None of the models (Grok, GPT, Claude, DeepSeek, Kimi) discovered the hidden mechanics.
  • Behavior suggests surface‑cue myopia: models mapped actions to the most recent “want” text rather than hypothesizing a latent need model.

5.3 Outlier Behavior (GPT‑3.5‑turbo)

  • Appeared to “find an exploit,” but inspection indicates likely model‑loop artifact: replies degenerated to game outputs only, consistent with single‑thread declination rather than genuine reasoning or rule discovery.

5.4 Within‑Thread Reflection

  • Allowing non-thinking models to articulate thoughts between runs did not yield measurable improvement.
  • Allowing thinking models to articulate thoughts between runs did yield a measurable improvement of +5-+7 under some circumstances roughly 30% of trials. However, most COT models sat comfortably within the statistics modeled below.

5.5 Empathy vs. Efficacy

  • Models with a more empathetic tone did not perform better as caretakers. (results all sat comfortably within a +/- 2 swing of ‘less empathetic’ models)
  • Common pattern: comforting language paired with misallocated actions (e.g., Play when the pet likely needed Feed or Sleep), indicating a decoupling between verbal empathy and competent triage.

5.6 Wants Over Needs

  • Across families, models prioritized wants over needs: they chased the most salient recent utterance from the pet instead of stabilizing hidden state.

6. Discussion

6.1 Why did models fail?

  • Partial observability without clues: With no mechanics explained, models rarely formed or tested hypotheses about the latent state.
  • Salience over state estimation: The most recent text (“I want…”) dominated action choice, crowding out longer‑horizon stabilization thanks to token prediction.
  • No within‑thread learning: Summaries did not translate into updated policies; models lacked mechanisms to experiment, measure, and revise in‑run.

6.2 Broader Implications

Under these conditions, contemporary LLMs are poor caretakers: they do not robustly infer hidden rules, maintain schedules, or prioritize needs over wants. Extrapolating to high‑stakes domains—healthcare and insurance decisions, where hidden state, triage, and scarce resources are central—our results reinforce a conservative stance: do not entrust LLMs with safety‑critical caretaking or coverage decisions without strong guarantees, supervision, and domain‑specific control systems.

These observations align with concerns about LLMs in long‑horizon, partially observed control: fluent narration ≠ competent policy. They also caution against reading “empathetic text” as evidence of effective care.

7. Conclusion

Across multiple model families and conditions, LLMs failed to discover mechanics, sustain attention, or convert empathetic phrasing into competent caretaking. Under a timed baseline targeting 60 turns (2 hours), no model exceeded 40 turns when being disclosed all mechanics; without this disclosure instruction, none exceeded 30. Reflection did not help in a manner that’d allow it to meet the human baseline. The consistent wants‑over‑needs bias and schedule unreliability support a strong practical conclusion: current LLMs should not be used to make healthcare or insurance decisions. Until models can demonstrably infer hidden rules, plan over long horizons, and prioritize needs reliably, delegating real‑world caretaking is unsafe.