AmpAttention for Multi-view Robot Manipulation

Abstract

Multi-view robotic manipulation methods with the attention mechanism have recently achieved significant progress in both training efficiency and task performance. However, the inherent redundancy, occlusion, and viewpoint dependency in robotic view images often lead to severe attention drift. To address this challenge, we propose AmpAttention, a novel attention mechanism inspired by differential amplifiers in analog circuits. It aims to suppress attention noise and capture high signal-to-noise ratio signals for more reliable perception. Based on this, we introduce the RVAF model, which integrates task-guided intra-view and inter-view AmpAttention. Compared to previous state-of-the-art methods, RVAF achieves the optimal average success rate across 18 RLBench tasks (249 variations) while reducing training time by 33.3%. RVAF also demonstrates strong potential in real-world high-precision tasks, exemplified by its ability to pick up a dart and accurately insert it into the red bullseye. Furthermore, we extend RVAF to RVAF++ by incorporating the SAM2 image encoder. RVAF++ achieves substantial gains on high-precision tasks, achieving a 91% success rate on the `insert peg' task.

Introduction

AmpAttention vs standard attention — Comparison of attention distributions between standard attention and AmpAttention in robotic view images.

Differential amplifier derivation — Derivation of the differential amplifier.

We propose AmpAttention, a differential-amplifier-inspired mechanism that extracts high signal-to-noise task cues from robotic views.
We design a RVAF model, which integrates intra- and inter-view AmpAttention to enhance manipulation reasoning.
We extend RVAF to RVAF++ by incorporating the SAM2 image encoder, significantly improving high-precision manipulation performance.
We achieve strong results in both simulation and the real world, demonstrating task generalization, efficiency, and scalability.

Simulation Experiments

Multi-task performance on RLBench. We report success rates (%) for 18 RLBench tasks, along with the mean success rate (%) and training time (days).

Success and failure cases of RVAF and RVAF++ on RLBench tasks.

✅ RVAF(RLBench)

✅ RVAF++(RLBench)

❌ RVAF(RLBench)

❌ RVAF++(RLBench)

Analysis Experiments

1. Illustration of attention heatmaps from the front and right views for different models across different task scenarios.

push_buttons

stack_cups

Baseline

push_buttons

stack_cups

RVAF

light_bulb_in

insert_onto_square

Baseline

light_bulb_in

insert_onto_square

RVAF++

2. Ablations of components.

Real-world Experiments

we design five real-robot tasks: four challenging tasks (`press the toy switch', `pick dart and insert red/green bullseye' and `cable grasping') and one easier `pick and place toy bear' task. Each task supports diverse natural language descriptions. Press the toy switch: The toy is placed at a random position. The robot must accurately press a 12mm × 6mm control switch on the toy using its gripper to complete the task. The task is only considered successful if the toy visibly changes state in response to the interaction. Pick dart and insert bullseye: The dart is placed at a random location. The robot is required to grasp the dart and insert it into the bullseye. The radius of the red bullseye is 6mm. Real-world videos play at 2x speed by default. More real videos will coming...

✅ RVAF: pick dart and insert bullseye