MobileGPT WhatsApp Based AI Chatbot Adds GPT-4 Vision Features

4 min readJan 15, 2024

GPT4 Vision was released by OpenAI within their API in November 2023. This model performs like the standard GPT4 model, except it can process visual inputs. It can identify image contents and respond in text to questions about image inputs.

GPT-4 Vision and Its Capabilities

GPT-4 Vision, or GPT-4V for short, is a smartly designed part of AI that mixes a special part that can ‘see’ with a complex language part. This mix is a big step forward in AI, as it allows the system to understand both pictures and words. The way it’s built, using deep learning (a kind of AI that mimics how humans learn), lets it handle complicated images.

GPT-4V gets really good at what it does by learning from a huge mix of pictures and text from the internet. This learning happens in two main steps. First, it learns how words and pictures are connected and work together. Then, it gets even better by practicing with a smaller set of really good quality data. This helps it become more accurate and trustworthy in giving the right information.

MobileGPT the WhatApp Based Chatbot

WhatsApp Based Chatbot (MobileGPT) adds GPT-4 Vision

MobileGPT is an independent application that brings ChatGPT to your WhatsApp, utilising the OpenAI API and Stable Diffusion to create an AI Assistant on WhatsApp that can:

perform normal AI conversations,
help users learn about any topic and subject
generate images and documents,
do online shopping, job search,
save notes and reminders on WhatsApp

And now — also respond to image queries

MobileGPT Vision Capabilities: What it can do

These are some of the things that MobileGPT Vision is capable of:

Processing Images with Ease: A standout feature of MobileGPT Vision is its ability to handle various types of visual data. Whether it’s a quick snapshot, a detailed screenshot, or important documents, MobileGPT Vision can analyze and interpret these visuals effectively.
Identifying and Understanding Objects: MobileGPT Vision excels in recognizing and providing insights about different elements in an image. It’s like having an AI assistant that can look at a photo and tell you about the objects and their details.
Analyzing Visual Data: This AI isn’t just limited to objects; it can also interpret complex visual data. MobileGPT Vision can dissect graphs, charts, and other visual data presentations, making it easier to understand complex information.
Deciphering Text in Photos: MobileGPT Vision is equipped to read both printed and handwritten text within images. This feature is particularly useful for quickly extracting and understanding information from photos containing text.
Contextual Understanding of Visual Scenes: Beyond just recognizing objects, MobileGPT Vision can understand the context and setting of an image, offering a deeper level of analysis about what’s happening in a picture.
Real-Time Image Processing: This AI can process images in real-time, making it ideal for applications that require immediate analysis and response, such as in chatbot interactions or instant decision-making scenarios.

Example Vision Screenshot

Limitations of MobileGPT Vision

Not Suitable for Medical Imaging: MobileGPT Vision isn’t equipped to interpret specialized medical images, such as CT scans. It’s important to note that it should never be used for medical advice or diagnoses.
Challenges with Non-Latin Alphabets: The model might face difficulties when dealing with images containing text in non-Latin alphabets, like Japanese or Korean. Its performance could be less optimal in these scenarios.
Difficulty with Small Text: While MobileGPT Vision can enlarge text to make it more readable, it’s not very effective with very small text. Also, be cautious not to crop out vital details when enlarging text.
Misinterpreting Rotated Images: Images or text that are upside down or rotated can confuse MobileGPT Vision, leading to misinterpretation
Struggles with Complex Visual Elements: The model may have trouble understanding images with complex visual elements, particularly in graphs or texts where colors and styles (like solid, dashed, or dotted lines) vary significantly.
Limited Spatial Reasoning: MobileGPT Vision finds it challenging to perform tasks that require precise spatial localization, such as accurately identifying chess positions on a board.
Accuracy Issues in Some Scenarios: There are instances where MobileGPT Vision may generate incorrect descriptions or captions, especially in complex or ambiguous scenarios.
Problems with Unusual Image Shapes: The AI struggles with panoramic and fisheye images due to their distorted perspectives and unusual shapes.
Ignoring Metadata and Resizing Impact: MobileGPT Vision does not process the original file names or metadata of images. Additionally, images are resized before analysis, which could alter their original dimensions and affect the analysis.
Approximate Object Counting: The model is capable of counting objects in images, but these counts may only be approximate and not precise.
Blocking CAPTCHAS for Safety: MobileGPT Vision has a safety feature that blocks the submission and analysis of CAPTCHAs, to prevent potential misuse.

MobileGPT WhatsApp Based AI Chatbot Adds GPT-4 Vision Features

GPT-4 Vision and Its Capabilities

MobileGPT the WhatApp Based Chatbot

MobileGPT Vision Capabilities: What it can do

Limitations of MobileGPT Vision

Written by Skolo Online Learning