Abstract: Vision-Language (VL) alignment across image and text modalities is a challenging task due to the inherent semantic ambiguity of data with multiple possible meanings. Existing methods ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results