In many robotic applications, especially those involving humans and the environment, linguistic and visual information must be processed jointly and bound together. Existing works either encode the image or the language into a subsymbolic space, like the CLIP model, or create a symbolic space of extracted information, like the object detection models. In this paper, we propose to describe images by nouns and modifiers and introduce a new embedded binding space where the linguistic and visual cues can effectively be bound. We investigate how state-of-the-art models perform in recognizing nouns and modifiers from images, and propose our method by introducing a dataset and CLIP-like recognition techniques based on transfer learning and metric learning. We show real-world experiments that demonstrate the practical applicability of our approach to robotics applications. Our results indicate that our method can surpass the state-of-the-art in recognizing nouns and modifiers from images. Interestingly, our method exhibits a language characteristic related to context sensitivity.