GraspSAM: When Segment Anything Model Meets Grasp Detection [ICRA 2025]


1Sangjun Noh 1Jongwon Kim 1Dongwoo Nam 2Seunghyeok Back 1Raeyoung Kang 1Kyoobin Lee

Corresponding author

1Gwangju Institute of Science and Technology (GIST)

2Korea Institute of Machinery & Materials (KIMM)







Abstract

Grasp detection requires flexibility to handle objects of various shapes without relying on prior object knowledge, while also offering intuitive, user-guided control. In this paper, we introduce GraspSAM, an innovative extension of the Segment Anything Model (SAM) designed for prompt-driven and category-agnostic grasp detection. Unlike previous methods, which are often limited by small-scale training data, GraspSAM leverages SAM’s large-scale training and prompt-based segmentation capabilities to efficiently support both target-object and category-agnostic grasping. By utilizing adapters, learnable token embeddings, and a lightweight modified decoder, GraspSAM requires minimal fine-tuning to integrate object segmentation and grasp prediction into a unified framework. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including Jacquard, Grasp-Anything, and Grasp-Anything++. Extensive experiments demonstrate GraspSAM’s flexibility in handling different types of prompts (such as points, boxes, and language), highlighting its robustness and effectiveness in real-world robotic applications.





GraspSAM Pipeline

GraspSAM Pipeline abs





GraspSAM Result

Grasp detection with 10 points

Methods Grasp-Anything [7] Jacquard [6]
Base New H Base New H
GR-ConvNet* [3] 0.68 0.55 0.61 0.82 0.61 0.70
Det-Seg-Refine* [4] 0.58 0.53 0.55 0.79 0.55 0.65
GG-CNN* [2] 0.65 0.53 0.58 0.73 0.52 0.61
LGD* [8] 0.69 0.57 0.62 0.83 0.64 0.72
GraspSAM-tiny (ours) 0.78 0.75 0.77 0.90 0.81 0.85
GraspSAM-t (ours) 0.83 0.81 0.82 0.87 0.75 0.81


Category-agnostic Grasp Dection

Methods Grasp-Anything [7] Jacquard [6]
Base New H Base New H
GR-ConvNet [3] 0.75 0.61 0.67 0.88 0.66 0.75
Det-Seg-Refine [4] 0.64 0.59 0.61 0.85 0.59 0.70
GG-CNN [2] 0.72 0.59 0.65 0.78 0.56 0.65
LGD [8] 0.77 0.65 0.70 0.89 0.70 0.78
GraspSAM-tiny (ours) 0.79 0.68 0.73 0.88 0.79 0.83
GraspSAM-t (ours) 0.89 0.82 0.85 0.83 0.72 0.77


Grasp dection with language

Methods Grasp-anthing ++ [8]
Base New H
CLIPORT [24] 0.36 0.26 0.29
CLIPGrasp [25] 0.40 0.29 0.33
LGD [8] 0.48 0.42 0.45
GraspSAM w/ G.D (Ours) 0.64 0.62 0.63





Inference GrapSAM in Real World Video

Prompt: 1point

Prompt: 10points


Prompt: Box

Prompt: Language


Prompt: Eye Gaze as Points





Additional Dataset Image Inference

Grasp-Anything(Seen)

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1

Grasp-Anything(Unseen)

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1





In the Wild Image Inference

Armbench

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1

GraspNet

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1

OCID

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1





Real Image Inference

Prompt: 1Point

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1

Prompt: 3Points

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1

Prompt: 5Points

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1

Prompt: 10Points

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1





Real Image Inference with Grounding Dino

Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1





Citation

@article{noh2024graspsam,
  title={GraspSAM: When Segment Anything Model Meets Grasp Detection},
  author={Noh, Sangjun and Kim, Jongwon and Nam, Dongwoo and Back, Seunghyeok and Kang, Raeyoung and Lee, Kyoobin},
  journal={arXiv preprint arXiv:2409.12521},
  year={2024}
}





Acknowledgements

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) No.RS2021-II212068, Artificial Intelligence Innovation Hub.





Author Contacts

Contact email to get more information on this project
GIST-AILAB Logo



[ Address : Dasan Building (C9) 204/206 & Central Research Facilities (C11) 403,
123 Cheomdangwagi-ro, Buk-gu, Gwangju, 61005, Korea ]