Social Navigation

Everything you need to know about Visual ChatGPT

News

Microsoft has unveiled a new model called Visual ChatGPT that merges basic visual models (VFMs) such as Transformers, ControlNet, and Stable Diffusion with ChatGPT.

According to media reports, the model enables interaction beyond language and extends the capabilities of ChatGPT.

ChatGPT has gained interdisciplinary interest due to its exceptional conversational competence and reasoning ability in several areas, making it a top choice for a language interface.

However, his linguistic training prevents him from processing or producing images from the visual environment. Meanwhile, basic visual models such as Visual Transformers or Stable Diffusion excel at single-round fixed input and output tasks, demonstrating remarkable visual understanding and generating capabilities. Combining these two models can lead to a new model, such as Visual ChatGPT, with the ability to process and generate visual input outside of language.

Microsoft researchers have created a system known as Visual ChatGPT that contains multiple basic visual models and allows users to interact with ChatGPT through graphical user interfaces. System capabilities include:

Visual ChatGPT can send and receive not only text but also images.

Visual ChatGPT can handle complex visual queries or editing instructions that require the collaboration of multiple AI models at different stages.

The researchers created a series of prompts that integrate visual model information into ChatGPT to accommodate models with multiple inputs/outputs and models that require visual feedback. Through testing, they have found that Visual ChatGPT enables the exploration of ChatGPT’s visual capabilities using basic visual models.

However, the researchers identified concerns in their work, including the inconsistent generation results due to the failure of visual foundation models (VFMs) and the variability of the prompts. They concluded that a self-correcting module is needed to ensure that the output matches human intentions and to make the necessary adjustments. However, including such a module can increase the model’s inference time due to the constant course correction. The team plans to investigate this issue further in a future study.


Joanna Swanson

Joanna Swanson is Europe correspondent at the Thomson Reuters Foundation based in Brussels covering politics, culture, business, climate change, society, economies and inclusive tech. With specific focus in breaking news, she has covered some of the world's most significant stories.