Abstract: Visual affordance grounding aims to segment all possible interaction regions between people and objects from an image/video, which benefits many applications, such as robot grasping and ...
Abstract: In this paper, we present a few-shot text-to-video frame-work, LAMP, which enables a text-to-image diffusion model to Learn A specific Motion Pattern with 8 ~16 videos on a single GPU.