Voice and image capabilities into ChatGPT, Amazon invests 4B$ in Anthropic, and Mistral LLM
Here are your weekly articles, guides, and news about NLP and AI chosen for you by NLPlanet!
😎 News From The Web
- ChatGPT can now see, hear, and speak. OpenAI has introduced new voice and image capabilities to their AI assistant, ChatGPT. Users can now engage in natural voice conversations and receive relevant responses. The image feature enables users to show ChatGPT images for assistance in interpretation.
- Amazon will invest up to $4 billion in Anthropic. Amazon has made a significant $4 billion investment in Anthropic. This partnership will enable Anthropic to benefit from Amazon Web Services (AWS), specifically leveraging AWS’s Trainium and Inferentia chips to enhance model training and deployment capabilities. Additionally, Anthropic will use Amazon Bedrock to optimize Claude versions and explore finetuning options.
- Mistral 7B. The Mistral 7B model, powered by Grouped-query attention (GQA) and Sliding Window Attention (SWA), outperforms other models in various domains while maintaining strong performance in both English and coding tasks. Its impressive benchmarks make it the best 7B model to date, enhancing inference speed and sequence handling efficiency.
- LLM Startup Embraces AMD GPUs, Says ROCm Has ‘Parity’ With Nvidia’s CUDA Platform. A startup called Lamini is using over 100 AMD Instinct MI200 GPUs and found that AMD’s ROCm software platform rivals Nvidia’s CUDA platform. They claim that running a large language model on their platform is 10x cheaper than on Amazon Web Services.
- OpenAI’s ChatGPT Now Searches the Web in Real Time — Again. OpenAI has reintroduced web searching for ChatGPT, allowing users to access recent information. Important updates have been made, including compliance with robots.txt rules and user agent identification, giving websites more control. Currently available to Plus and Enterprise users, expansion plans are in progress.
📚 Guides From The Web
- First Impressions with GPT-4V(ision). OpenAI has released GPT-4V for Plus users, showcasing its image processing skills, OCR capabilities, and performance in solving mathematical problems. However, it still faces challenges with object detection and struggles with CAPTCHA and grid-based puzzles.
- The Llama Ecosystem: Past, Present, and Future. Llama 2, released by Meta, aims to broaden access to state-of-the-art AI technology. It brings value through research collaboration, enterprise insights, and leveraging emerging AI advancements.
- Non-engineers guide: Train a LLaMA 2 chatbot. Hugging Face offers a no-code solution for AI practitioners to build, train, and deploy chatbots. With tools like AutoTrain, ChatUI, and Spaces, even non-ML specialists can create advanced ML models, fine-tune LLMs, and easily interact with open-source LLMs. Spaces also simplifies the deployment of pre-configured ML applications and custom ML apps.
- AI, Hardware, and Virtual Reality. This content explores the potential of AI, hardware, and virtual reality (VR). It discusses how AI challenges human limitations, enabling tasks that previously required human attention. It also highlights Meta’s focus on AI integration in smart glasses to enhance user experience.
- Student Use Cases for AI. Generative AI tools and Large Language Models (LLMs) have the potential to bring significant changes to education. While they empower students and educators with advanced technology, they also present challenges like the need for user verification and potential biases.
🔬 Interesting Papers and Repositories
- QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models. QA-LoRA, a new method in quantization-aware training, outperforms QLoRA in terms of efficiency and accuracy. It effectively balances the trade-off between quantization and adaptation, leading to minimal accuracy loss. QA-LoRA is especially effective in low-bit quantization scenarios like INT2/INT3, without the need for post-training quantization, and it can be applied to various model sizes and tasks.
- Small-scale proxies for large-scale Transformer training instabilities. A study has discovered that instabilities in training large Transformer-based models can be detected in advance by analyzing activations and gradient norms. These instabilities, which occur in both smaller and larger models with higher learning rates, can be mitigated using strategies employed in large-scale settings.
- Language models in molecular discovery. Language models are being used in chemistry to accelerate the process of molecule discovery and show potential in early-stage drug research. These models assist in de novo drug design, property prediction, and reaction chemistry, offering a faster and more effective approach to the field. Moreover, open-source software for language modeling is enabling scientists to easily access and advance scientific language modeling, facilitating quicker chemical discoveries.
- Tabby: Self-hosted AI coding assistant. Tabby is a fast and efficient open-source AI coding assistant compatible with popular language models. It supports CPU and GPU for coding tasks and offers swift coding experiences.
- The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. Researchers have discovered a phenomenon called the “Reversal Curse” that affects the generalization abilities of auto-regressive language models (LLMs). These models struggle to infer the reverse of a fact, hindering their ability to answer related questions accurately. Even larger models like GPT-3.5 and GPT-4 face challenges in addressing this issue, indicating a need for further advancements in language modeling.
- RMT: Retentive Networks Meet Vision Transformers. The Retentive Network (RetNet) has gained attention in the NLP community and shows potential as a replacement for Transformers. The combination of RetNet and Transformer, known as RMT, achieves outstanding results in vision tasks, with high performance metrics and surpassing existing vision backbones in downstream tasks.
- MotionLM: Multi-Agent Motion Forecasting as Language Modeling. MotionLM is a new model that uses language processing to accurately predict the movements of multiple cars on the road. It outperforms other models by generating joint distributions and ranking the future interactions of agents in a single decoding process. MotionLM has demonstrated its effectiveness by securing the top spot on the Waymo Open Motion Dataset challenge leaderboard.
- AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model. Introducing AnyMAL, a multimodal model designed to process diverse input signals including text, image, video, audio, and IMU motion sensor data. With its powerful text-based reasoning capabilities and a pre-trained aligner module, AnyMAL effectively understands and processes varied inputs. It is fine-tuned with a multimodal instruction set, expanding its capabilities beyond traditional question-answer scenarios.