Vision-Language Model for Robot Grasping

Abhinav Kaushal Keshari, Purdue University

Abstract

Robot grasping is emerging as an active area of research in robotics as the interest in humanrobot interaction is gaining worldwide because of diverse industrial settings for sharing tasks and workplaces. It mainly focuses on the quality of generated grasps for object manipulation. However, despite advancements, these methods need to consider the human-robot collaboration settings where robots and humans will have to grasp the same objects concurrently. Therefore, generating robot grasps compatible with human preferences of simultaneously holding an object is necessary to ensure a safe and natural collaboration experience. In this work, we propose a novel, deep neural network-based method called CoGrasp that generates human-aware robot grasps by contextualizing human preference models of object grasping into the robot grasp selection process. We validate our approach against existing state-of-the-art robot grasping methods through simulated and real-robot experiments and user studies. In real robot experiments, our method achieves about 88% success rate in producing stable grasps that allow humans to interact and grasp objects simultaneously in a socially compliant manner. Furthermore, our user study with 10 independent participants indicated our approach enables a safe, natural, and socially aware humanrobot objects' co-grasping experience compared to a standard robot grasping technique. To facilitate the grasping process, we also introduce a vision-language model that works as a pre-processing system before the grasping action takes place. In most settings, the robots are equipped with sensors that allow them to capture the scene, on which the vision model is used to do a detection task and objectify the visible contents in the environment. The language model is used to program the robot to make it possible for them to understand and execute the required sequence of tasks. Using the process of object detection, we build a set of object queries from the sensor image and allow the user to provide an input query for a task to be performed. We then perform a similarity score among these queries to localize the object that needs attention, and once identified, we can use a grasping process for the task at hand.

Degree

M.Sc.

Advisors

Pomeranz, Purdue University.

Subject Area

Artificial intelligence|Logic|Robotics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS