1Sangjun Noh | 1Jongwon Kim | 1Dongwoo Nam | 1Seunghyeok Back | 1Raeyoung Kang | 1Kyoobin Lee† |
†Corresponding author |
Grasp detection requires flexibility to handle objects of various shapes without relying on prior object knowledge, while also offering intuitive, user-guided control. In this paper, we introduce GraspSAM, an innovative extension of the Segment Anything Model (SAM) designed for prompt-driven and category-agnostic grasp detection. Unlike previous methods, which are often limited by small-scale training data, GraspSAM leverages SAM’s large-scale training and prompt-based segmentation capabilities to efficiently support both target-object and category-agnostic grasping. By utilizing adapters, learnable token embeddings, and a lightweight modified decoder, GraspSAM requires minimal fine-tuning to integrate object segmentation and grasp prediction into a unified framework. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including Jacquard, Grasp-Anything, and Grasp-Anything++. Extensive experiments demonstrate GraspSAM’s flexibility in handling different types of prompts (such as points, boxes, and language), highlighting its robustness and effectiveness in real-world robotic applications.
Methods | Grasp-Anything [7] | Jacquard [6] | ||||
---|---|---|---|---|---|---|
Base | New | H | Base | New | H | |
GR-ConvNet* [3] | 0.68 | 0.55 | 0.61 | 0.82 | 0.61 | 0.70 |
Det-Seg-Refine* [4] | 0.58 | 0.53 | 0.55 | 0.79 | 0.55 | 0.65 |
GG-CNN* [2] | 0.65 | 0.53 | 0.58 | 0.73 | 0.52 | 0.61 |
LGD* [8] | 0.69 | 0.57 | 0.62 | 0.83 | 0.64 | 0.72 |
GraspSAM-tiny (ours) | 0.78 | 0.75 | 0.77 | 0.90 | 0.81 | 0.85 |
GraspSAM-t (ours) | 0.83 | 0.81 | 0.82 | 0.87 | 0.75 | 0.81 |
Methods | Grasp-Anything [7] | Jacquard [6] | ||||
---|---|---|---|---|---|---|
Base | New | H | Base | New | H | |
GR-ConvNet [3] | 0.75 | 0.61 | 0.67 | 0.88 | 0.66 | 0.75 |
Det-Seg-Refine [4] | 0.64 | 0.59 | 0.61 | 0.85 | 0.59 | 0.70 |
GG-CNN [2] | 0.72 | 0.59 | 0.65 | 0.78 | 0.56 | 0.65 |
LGD [8] | 0.77 | 0.65 | 0.70 | 0.89 | 0.70 | 0.78 |
GraspSAM-tiny (ours) | 0.79 | 0.68 | 0.73 | 0.88 | 0.79 | 0.83 |
GraspSAM-t (ours) | 0.89 | 0.82 | 0.85 | 0.83 | 0.72 | 0.77 |
Methods | Grasp-anthing ++ [8] | ||
---|---|---|---|
Base | New | H | |
CLIPORT [24] | 0.36 | 0.26 | 0.29 |
CLIPGrasp [25] | 0.40 | 0.29 | 0.33 |
LGD [8] | 0.48 | 0.42 | 0.45 |
GraspSAM w/ G.D (Ours) | 0.64 | 0.62 | 0.63 |
@article{noh2024graspsam,
title={GraspSAM: When Segment Anything Model Meets Grasp Detection},
author={Noh, Sangjun and Kim, Jongwon and Nam, Dongwoo and Back, Seunghyeok and Kang, Raeyoung and Lee, Kyoobin},
journal={arXiv preprint arXiv:2409.12521},
year={2024}
}
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) No.RS2021-II212068, Artificial Intelligence Innovation Hub.