GraspSAM: When Segment Anything Model Meets Grasp Detection


1Sangjun Noh 1Jongwon Kim 1Dongwoo Nam 1Seunghyeok Back 1Raeyoung Kang 1Kyoobin Lee

Corresponding author

1Gwangju Institute of Science and Technology (GIST)







Abstract

Grasp detection requires flexibility to handle objects of various shapes without relying on prior object knowledge, while also offering intuitive, user-guided control. In this paper, we introduce GraspSAM, an innovative extension of the Segment Anything Model (SAM) designed for prompt-driven and category-agnostic grasp detection. Unlike previous methods, which are often limited by small-scale training data, GraspSAM leverages SAM’s large-scale training and prompt-based segmentation capabilities to efficiently support both target-object and category-agnostic grasping. By utilizing adapters, learnable token embeddings, and a lightweight modified decoder, GraspSAM requires minimal fine-tuning to integrate object segmentation and grasp prediction into a unified framework. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including Jacquard, Grasp-Anything, and Grasp-Anything++. Extensive experiments demonstrate GraspSAM’s flexibility in handling different types of prompts (such as points, boxes, and language), highlighting its robustness and effectiveness in real-world robotic applications.





GraspSAM Pipeline

GraspSAM Pipeline abs





GraspSAM Result

Grasp detection with 10 points

Methods Grasp-Anything [7] Jacquard [6]
Base New H Base New H
GR-ConvNet* [3] 0.68 0.55 0.61 0.82 0.61 0.70
Det-Seg-Refine* [4] 0.58 0.53 0.55 0.79 0.55 0.65
GG-CNN* [2] 0.65 0.53 0.58 0.73 0.52 0.61
LGD* [8] 0.69 0.57 0.62 0.83 0.64 0.72
GraspSAM-tiny (ours) 0.78 0.75 0.77 0.90 0.81 0.85
GraspSAM-t (ours) 0.83 0.81 0.82 0.87 0.75 0.81


Category-agnostic Grasp Dection

Methods Grasp-Anything [7] Jacquard [6]
Base New H Base New H
GR-ConvNet [3] 0.75 0.61 0.67 0.88 0.66 0.75
Det-Seg-Refine [4] 0.64 0.59 0.61 0.85 0.59 0.70
GG-CNN [2] 0.72 0.59 0.65 0.78 0.56 0.65
LGD [8] 0.77 0.65 0.70 0.89 0.70 0.78
GraspSAM-tiny (ours) 0.79 0.68 0.73 0.88 0.79 0.83
GraspSAM-t (ours) 0.89 0.82 0.85 0.83 0.72 0.77


Grasp dection with language

Methods Grasp-anthing ++ [8]
Base New H
CLIPORT [24] 0.36 0.26 0.29
CLIPGrasp [25] 0.40 0.29 0.33
LGD [8] 0.48 0.42 0.45
GraspSAM w/ G.D (Ours) 0.64 0.62 0.63





Inference GrapSAM in Real World Video

Prompt: 1point

Prompt: 10points


Prompt: Box

Prompt: Language


Prompt: Eye Gaze as Points





Additional Dataset Image Inference

Grasp-Anything(Seen)

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1

Grasp-Anything(Unseen)

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1





In the Wild Image Inference

Armbench

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1

GraspNet

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1

OCID

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1





Real Image Inference

Prompt: 1Point

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1

Prompt: 3Points

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1

Prompt: 5Points

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1

Prompt: 10Points

RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
RGB Image
Prompt
Pred Mask
Pred Grasp
Quality Map
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1





Real Image Inference with Grounding Dino

Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1
Additional Result 1 Additional Result 1





Citation

@article{noh2024graspsam,
  title={GraspSAM: When Segment Anything Model Meets Grasp Detection},
  author={Noh, Sangjun and Kim, Jongwon and Nam, Dongwoo and Back, Seunghyeok and Kang, Raeyoung and Lee, Kyoobin},
  journal={arXiv preprint arXiv:2409.12521},
  year={2024}
}





Acknowledgements

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) No.RS2021-II212068, Artificial Intelligence Innovation Hub.





Author Contacts

Contact email to get more information on this project
GIST-AILAB Logo



[ Address : Dasan Building (C9) 204/206 & Central Research Facilities (C11) 403,
123 Cheomdangwagi-ro, Buk-gu, Gwangju, 61005, Korea ]