GraspSAM: When Segment Anything Model Meets Grasp Detection [ICRA 2025]

^†Corresponding author

¹Gwangju Institute of Science and Technology (GIST)

²Korea Institute of Machinery & Materials (KIMM)

Abstract

Grasp detection requires flexibility to handle objects of various shapes without relying on prior object knowledge, while also offering intuitive, user-guided control. In this paper, we introduce GraspSAM, an innovative extension of the Segment Anything Model (SAM) designed for prompt-driven and category-agnostic grasp detection. Unlike previous methods, which are often limited by small-scale training data, GraspSAM leverages SAM’s large-scale training and prompt-based segmentation capabilities to efficiently support both target-object and category-agnostic grasping. By utilizing adapters, learnable token embeddings, and a lightweight modified decoder, GraspSAM requires minimal fine-tuning to integrate object segmentation and grasp prediction into a unified framework. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including Jacquard, Grasp-Anything, and Grasp-Anything++. Extensive experiments demonstrate GraspSAM’s flexibility in handling different types of prompts (such as points, boxes, and language), highlighting its robustness and effectiveness in real-world robotic applications.

GraspSAM Pipeline

GraspSAM Result

Grasp detection with 10 points

Methods	Grasp-Anything [7]			Jacquard [6]
Methods	Base	New	H	Base	New	H
GR-ConvNet* [3]	0.68	0.55	0.61	0.82	0.61	0.70
Det-Seg-Refine* [4]	0.58	0.53	0.55	0.79	0.55	0.65
GG-CNN* [2]	0.65	0.53	0.58	0.73	0.52	0.61
LGD* [8]	0.69	0.57	0.62	0.83	0.64	0.72
GraspSAM-tiny (ours)	0.78	0.75	0.77	0.90	0.81	0.85
GraspSAM-t (ours)	0.83	0.81	0.82	0.87	0.75	0.81

Category-agnostic Grasp Dection

Methods	Grasp-Anything [7]			Jacquard [6]
Methods	Base	New	H	Base	New	H
GR-ConvNet [3]	0.75	0.61	0.67	0.88	0.66	0.75
Det-Seg-Refine [4]	0.64	0.59	0.61	0.85	0.59	0.70
GG-CNN [2]	0.72	0.59	0.65	0.78	0.56	0.65
LGD [8]	0.77	0.65	0.70	0.89	0.70	0.78
GraspSAM-tiny (ours)	0.79	0.68	0.73	0.88	0.79	0.83
GraspSAM-t (ours)	0.89	0.82	0.85	0.83	0.72	0.77

Grasp dection with language

Methods	Grasp-anthing ++ [8]
Methods	Base	New	H
CLIPORT [24]	0.36	0.26	0.29
CLIPGrasp [25]	0.40	0.29	0.33
LGD [8]	0.48	0.42	0.45
GraspSAM w/ G.D (Ours)	0.64	0.62	0.63

Inference GrapSAM in Real World Video

Prompt: 1point

Prompt: 10points

Prompt: Box

Prompt: Language

Prompt: Eye Gaze as Points

Additional Dataset Image Inference

Grasp-Anything(Seen)

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

Grasp-Anything(Unseen)

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

In the Wild Image Inference

Armbench

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

GraspNet

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

OCID

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

Real Image Inference

Prompt: 1Point

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

Prompt: 3Points

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

Prompt: 5Points

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

Prompt: 10Points

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

RGB Image

Prompt

Pred Mask

Pred Grasp

Quality Map

Real Image Inference with Grounding Dino

Give me a car toy

I need a shoe

Give me a bus toy

Give me a mayonnaise

Give me a cup

Give me a box

A rigid object

A rigid object

A rigid object

A rigid object

A rigid object

A rigid object

Citation

@article{noh2024graspsam,
  title={GraspSAM: When Segment Anything Model Meets Grasp Detection},
  author={Noh, Sangjun and Kim, Jongwon and Nam, Dongwoo and Back, Seunghyeok and Kang, Raeyoung and Lee, Kyoobin},
  journal={arXiv preprint arXiv:2409.12521},
  year={2024}
}

Acknowledgements

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) No.RS2021-II212068, Artificial Intelligence Innovation Hub.

Author Contacts

Contact email to get more information on this project
GIST-AILAB Logo

[ Address : Dasan Building (C9) 204/206 & Central Research Facilities (C11) 403,
123 Cheomdangwagi-ro, Buk-gu, Gwangju, 61005, Korea ]