Language-Guided Grasping in Clutter: Foundation-Model-Driven Target Selection and Task Execution Verification

If you would like to view the project code or the full report paper, please feel free to contact me: fanc3@illinois.edu

Abstract

This project develops a language-guided grasp-and-stack framework for a cluttered tabletop environment in Isaac Lab/Isaac Sim. The clutter implies multiple objects in arbitrary poses with substantial occlusion and overlap. This system accepts natural-language stacking instructions, autonomously removes purple clutter cubes when targets are not fully visible, performs top-down grasps and stacking using differential inverse kinematics, and employs a vision-language model (VLM) to automatically verify stacking success. I demonstrate reliable single-, two-, and three-cube stacking under heavy occlusion, robustness across varying clutter densities, and the feasibility of replacing manual success labeling with lightweight VLM-based judgment.

Introduction

Cluttered settings are common in real-world manipulation, from bins and tabletops to disaster-response environments. A classical perception-to-control pipeline can succeed for closed-set objects, yet the usability gap remains large: end users often wish to specify tasks in natural language, and large-scale evaluations require automated success detection rather than human-in-the-loop labeling.

This project addresses the challenge of grasping in clutter by integrating foundation-model interfaces into an end-to-end grasp–remove–stack loop. In this work, the final system not only grasps in clutter but also stacks colored cubes and uses LLM/VLM modules to (i) convert free-form language into structured color commands and (ii) judge stacking success automatically.

Framework

At a high level, the system takes a natural-language instruction (e.g., “The first cube in the stack is red, the second is green”), uses an LLM/VLM to generate a sequence of primitive color commands, and executes each command via a perception-driven top-down grasp and a height-aware placement policy. When no fully visible target cube is found within the ROI, the robot grasps a purple clutter cube and discards it off the main table surface, then rechecks for targets. After placement, the system captures a front-facing camera view of the stack and queries a VLM for a binary success decision; failures trigger retries, while successes increment the stack counter and advance to the next language-implied subgoal.

Figure: System overview of language-guided grasp-and-stack with clutter removal.

Experiments

All experiments were conducted in NVIDIA Isaac Sim / Isaac Lab with a UR5e arm and a parallel-jaw gripper. The environment is a cluttered tabletop scene containing three target color families (red/green/blue) and a larger number of purple cubes that act as occluders. Each cube has an edge length of 0.05m. The system uses two RGB-D cameras: an overhead birdview camera for perception and 3D localization, and a front birdview camera for stack-success judgment. The colored cubes are spawned as 6 per color (total 18), while the clutter count is randomized depending on the experiment. I evaluated my system under heavy occlusion/overlap and focus on both task-level autonomy and foundation-model-enabled interaction. I conducted five sets of experiments, each reporting success/failure at three checkpoints:

Three Checkpoints

Checkpoint 1 (Grasp): Successful acquisition and lift of the target cube.
Checkpoint 2 (Place): Successful placement at the intended stack location and height.
Checkpoint 3 (Detect): Correct success recognition of the stacking outcome (manual or VLM).

Five Experiment Sets

Experiment Set 1 (Single-Cube Language-Guided Stacking): In the same cluttered environment (random clutter count between 50 and 60), the user issues a single-cube instruction such as “Stack one red cube.” I repeat 10 trials with varied prompt styles. The result is shown in Table I.

Experiment Set 2 (Two-Cube Language-Guided Stacking): The user gives a two-cube instruction, e.g., “The first cube is red, the second is green.” The system uses the same LLM-driven step-by-step parsing to generate cmd_1 then cmd_2. I record results for each sub-task in 10 trials. The result is shown in Table II.

Experiment Set 3 (Three-Cube Language-Guided Stacking): The user provides a three-cube instruction, e.g., “Stack three cubes: first red, second green, third blue.” Results were recorded for each of the three subtasks across 10 trials. The result is shown in Table III.

Experiment Set 4 (Robustness to Clutter Density): I vary the clutter density and use the same prompt (“The first cube in the stack is red.”). Each density setting is evaluated five times to probe robustness under increasing occlusion. The result is shown in Table IV.

Experiment Set 5 (Feasibility of Different VLMs for Stack-Success Detection): I keep the same clutter configuration (50–60) and the same prompt (“The first cube in the stack is red.”), while varying the stack-success detection method: manual judgment vs. different VLM backbones. The goal is to validate that the success-checkpoint can be replaced by different foundation models with minimal pipeline changes. The result is shown in Table V.

Five Representative Failures

(a) Grasp Confusion Under Partial Visibility

(b) Placement Slip During Multi-Cube Sequences

(d) Occasional VLM Judgment Inconsistency

(e) Perception-Driven Grasp Failures

Note: This page will be continuously updated as the project progresses.

Page updated

Google Sites

Report abuse