Salesforce releases ‘xGen-MM’ open-source multimodal AI models to advance visual language understanding


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Salesforce, the enterprise software giant, has released a new suite of open-source large multimodal AI models that could accelerate research and development of more capable artificial intelligence systems.

The models, dubbed xGen-MM (also known as BLIP-3), represent a significant advance in AI’s ability to understand and generate content combining text, images and other data types.

In a paper published on arXiv, researchers from Salesforce AI Research detailed the xGen-MM framework, which includes pre-trained models, datasets, and code for fine-tuning. The largest model, with 4 billion parameters, achieves competitive performance on various benchmarks compared to similar-sized open-source models.

“We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research,” the authors wrote in the paper. This move marks a departure from the trend of keeping advanced AI models proprietary, potentially democratizing access to cutting-edge multimodal AI technology.

A schematic diagram of the xGen-MM (BLIP-3) framework, showing how it processes interleaved image and text data. The model uses a Vision Transformer to encode images, a token sampler to compress visual information, and a pre-trained large language model to generate text, with losses applied to text tokens. Credit: Salesforce AI Research

Unleashing AI’s potential: Salesforce’s game-changing open-source models

A key innovation of xGen-MM is its ability to handle “interleaved data” combining multiple images and text, which the researchers describe as “the most natural form of multimodal data.” This capability allows the models to perform complex tasks like answering questions about multiple images simultaneously, a skill that could prove invaluable in real-world applications ranging from medical diagnosis to autonomous vehicles.

The release includes variants of the model optimized for different purposes, including a base pretrained model, an “instruction-tuned” model for following directions, and a “safety-tuned” model designed to reduce harmful outputs. This range of models reflects a growing awareness in the AI community of the need to balance capability with safety and ethical considerations.

Salesforce’s decision to open-source these models could significantly accelerate innovation in the field. By providing researchers and developers with access to high-quality models and datasets, Salesforce is enabling a wider range of participants to contribute to the advancement of multimodal AI. This move stands in contrast to the more closed approaches of some tech giants, who have kept their most advanced models under wraps.

However, the release of such powerful models also raises important questions about the potential risks and societal impacts of increasingly capable AI systems. While Salesforce has included safety tuning to mitigate risks, the broader implications of widespread access to advanced AI models remain a topic of debate in the tech community and beyond.

Beyond text and images: The rise of interleaved ,ultimodal AI

The xGen-MM models were trained on massive datasets curated by the Salesforce team, including a trillion-token scale dataset of interleaved image and text data called “MINT-1T.” The researchers also created new datasets focused on optical character recognition and visual grounding, areas that are crucial for AI systems to interact more naturally with the visual world.

As AI systems become more advanced and ubiquitous, Salesforce’s open-source release provides valuable tools for researchers to better understand and improve these powerful technologies. It also sets a precedent for transparency in a field often criticized for its lack of openness. The move could pressure other tech giants to be more forthcoming with their own AI research and development.

Democratizing AI: How Salesforce’s xGen-MM could reshape the tech landscape

As the AI arms race continues to heat up, Salesforce’s open approach could prove to be a strategic differentiator. By fostering a collaborative ecosystem around its models, the company may be able to innovate more quickly and build goodwill within the research community. However, it remains to be seen how this strategy will play out in the highly competitive world of enterprise AI solutions.

The code, models, and datasets for xGen-MM are available on Salesforce’s GitHub repository, with additional resources coming soon to the project’s website. As researchers and developers begin to explore and build upon these models, the true impact of Salesforce’s contribution to the field of multimodal AI will become clearer in the months and years to come.



Source link

About The Author

Scroll to Top