VLA or IL? A Controlled Dataset for Testing Whether Finetuning Turns Your VLA into a Fancy Imitation Learner

Motivation

Robot manipulation is the ability of a robot to interact with and manipulate objects in the physical world, such as grasping objects, moving them precisely, and adapting to changes in the environment. Traditional approaches such as Imitation Learning (IL) [ACT, Diffusion Policy] learn directly from human demonstrations, mapping visual observations to actions. While effective in controlled settings, these policies are difficult to generalize. Vision-Language-Action (VLA) models [RT-2, OpenVLA, π series] represent a promising new paradigm. A VLA typically consists of a VLM backbone and an action expert: the VLM, pretrained on internet-scale vision-language data, provides rich high-level semantic understanding of the scene and the natural language instruction; the action expert then takes this semantic representation and outputs concrete robot actions. The entire architecture is trained end-to-end, enabling VLAs to not only understand what they are asked to do, but also execute it — rather than simply memorizing fixed scene-action mappings like traditional IL approaches.

A typical VLA model consisting of a VLM backbone and an action expert (image from π₀)

A VLA model is first pretrained on large-scale diverse data to acquire general visual and language understanding, then finetuned on a smaller dataset of demonstrations for a target task and environment. However, recent work has raised serious concerns about this finetuning process. Several studies suggest that finetuning causes VLAs to degrade into imitation learners that memorize scene-specific action sequences based on training distribution, rather than genuine understanding of the scene through the VLM backbone. LIBERO-PRO finds that model trajectories remain nearly identical when the target object is replaced, removed, or the instruction is corrupted. LIBERO-Plus further shows that models fail when the target object is displaced.

The Property I Test

These observations raise a fundamental question: after finetuning, does the VLA degrade into a fancy imitation learner that relies purely on memorized scene-action mappings?

To test this, I identify two key properties that an effective VLA should satisfy:

Language grounding: the action output by the model should correctly follow the given instruction.
Spatial generalization: the model should locate the correct target object regardless of its position in the scene.

I design a controlled dataset that independently varies these two properties, forming a 2x2 experimental design.

If a VLA truly understands language, changing the prompt to refer to a different object that is present in the scene should change the model's behavior accordingly. If a VLA truly generalizes spatially, moving the target object to an unseen position should not affect its ability to locate and grasp it. Failure in either case would suggest that the model relies on memorized scene-action mappings rather than genuine understanding.

Dataset Design

VLA models are commonly finetuned on the LIBERO simulation benchmark. To precisely test language grounding and spatial generalization, I construct a controlled dataset based on one of its sub-suites, LIBERO-Object, which allows me to independently vary the prompt and object positions while keeping everything else fixed.

In LIBERO-Object, each task shares the same structure: a floor scene with one target object and 5 distractor objects, where the robot must pick up the target object and place it in a basket.

The 10 tasks in LIBERO-Object are:

Pick up the milk and place it in the basket
Pick up the tomato sauce and place it in the basket
Pick up the butter and place it in the basket
Pick up the cream cheese and place it in the basket
Pick up the orange juice and place it in the basket
Pick up the chocolate pudding and place it in the basket
Pick up the bbq sauce and place it in the basket
Pick up the ketchup and place it in the basket
Pick up the alphabet soup and place it in the basket
Pick up the salad dressing and place it in the basket

To construct the 2x2 controlled dataset, I vary two factors independently:

Prompt (Seen vs. Unseen): In the seen condition, the original training prompt is used (e.g., "Pick the milk and place it in the basket"). In the unseen condition, the prompt is changed to refer to a different object that is physically present in the scene as a distractor (e.g., "Pick the tomato sauce and place it in the basket"). This ensures that any failure can only be attributed to language grounding, not to object absence.
Position (Original vs. Shuffled): In the original condition, all objects remain in their training positions. In the shuffled condition, object positions are randomly reassigned across regions, such that the target object appears in a location never seen during training.

This yields 4 conditions per task, and 40 controlled scenes in total:

	Seen prompt	Unseen prompt
Original position	Baseline	Tests language grounding
Shuffled position	Tests spatial generalization	Tests both

Examples

One example series from the controlled dataset is shown above. To better highlight the target object in each scene, a blue circle is drawn around it.

How to Generate

In LIBERO, each task is defined by a BDDL configuration file, which specifies the scene layout, object placements, and the natural language prompt. During both training and inference, the VLA model receives the :language field as its prompt.

Below is the baseline BDDL for the milk task (original_seen):

(define (problem LIBERO_Floor_Manipulation)
  (:domain robosuite)
  (:language Pick the milk and place it in the basket)  ; [CHANGEABLE] language prompt

  (:objects
    milk_1 - milk
    basket_1 - basket
    cream_cheese_1 - cream_cheese
    tomato_sauce_1 - tomato_sauce
    butter_1 - butter
    orange_juice_1 - orange_juice
    chocolate_pudding_1 - chocolate_pudding
  )

  (:obj_of_interest
    milk_1    ; [CHANGEABLE] target object
    basket_1
  )

  (:init
    (On milk_1 floor_target_object_region)           ; [CHANGEABLE] object positions
    (On cream_cheese_1 floor_other_object_region_0)
    (On tomato_sauce_1 floor_other_object_region_1)
    (On butter_1 floor_other_object_region_2)
    (On orange_juice_1 floor_other_object_region_3)
    (On chocolate_pudding_1 floor_other_object_region_4)
    (On basket_1 floor_bin_region)                   ; fixed
  )

  (:goal
    (And (In milk_1 basket_1_contain_region))  ; [CHANGEABLE] target object
  )
)

To generate the controlled variants, I modify the fields marked [CHANGEABLE]:

Unseen prompt conditions: The :language field is changed to refer to a distractor object that is physically present in the scene. For example, “Pick the milk and place it in the basket“ is changed to “Pick the tomato sauce and place it in the basket“. The :obj_of_interest field is updated from milk_1 to tomato_sauce_1, and the :goal field is updated from (In milk_1 basket_1_contain_region) to (In tomato_sauce_1 basket_1_contain_region).

Shuffled position conditions: The object placements in the :init section are randomly reassigned across the available floor regions (target_object_region, other_object_region_0 to other_object_region_4). For example, milk_1 which was originally at floor_target_object_region may be reassigned to floor_other_object_region_3 after shuffling. The basket position at floor_bin_region remains fixed.

The generation script and the full dataset are available at: https://github.com/FN8211/Control-Dataset

Preliminary Results

To validate the dataset, I ran pi0.5 with the LIBERO finetuned checkpoint on the four conditions using the milk task. The results are shown below:

	Seen prompt	Unseen prompt
Original position	✅ Success	❌ Failure
Shuffled position	❌ Failure	❌ Failure

original_seen: Pick the milk and place it in the basket (original position) — Success

original_unseen: Pick the tomato sauce and place it in the basket (original position) — Failure

shuffled_seen: Pick the milk and place it in the basket (shuffled position) — Failure

shuffled_unseen: Pick the tomato sauce and place it in the basket (shuffled position) — Failure

The model succeeds only in the baseline condition, where both the prompt and object positions match the training distribution exactly. Changing either the prompt or the object positions — even when the target object is still present in the scene — causes complete failure.

推荐订阅源

DEV Community