AI Image Caption Generator

Free AI image caption generator running offline in your browser via the vit-gpt2-image-captioning model. Get alt text and 3 caption suggestions, no upload.

Have feedback? Report bugs, suggest features, or share your thoughts — we read them all

About AI Image Caption Generator

AI Image Caption Generator looks at a photograph and writes a natural-English sentence describing what it sees, such as "a brown dog running through tall grass" or "a plate of pasta with red sauce on a wooden table". It uses the open-source vit-gpt2-image-captioning vision-language model (a ViT image encoder with a GPT-2 text decoder) that has been trained on millions of image and caption pairs, so it learns the relationship between visual features (edges, colors, objects, scenes) and human-written descriptions. The model runs entirely inside your browser tab through the Hugging Face Transformers.js runtime, with a WebGPU backend when available and a WebAssembly fallback otherwise, which means your picture never gets uploaded to any server. Typical uses include drafting alt-text for accessibility and SEO, naming and tagging large photo libraries, creating starting captions for social posts, helping visually impaired users explore images, and assisting content moderators who need a quick textual hint about what a photo contains. The first call downloads the model weights (around 250 MB) into the browser cache so subsequent captions take only a couple of seconds. Quality is best on everyday scenes, animals, food, objects, and outdoor photos; abstract art, charts, and text-heavy images are harder for the model and may produce generic captions. See also our AI keyword extractor and AI translator.

What does the AI Image Caption Generator actually do?

The tool reads an image you upload, runs it through a deep neural network that combines a ViT vision encoder with a GPT-2 language decoder (the vit-gpt2-image-captioning model), and returns a one-sentence English description of the picture. The model has been trained on millions of image and caption pairs scraped from the public web, so it has learned visual concepts (dog, beach, pizza, computer) and the typical sentence patterns humans use to describe scenes ("a X doing Y in/on Z"). The result is short, factual, and works well as alt text, an SEO image description, or a starting point for a longer caption. It does not invent stories, name specific people, or read text inside the image.

What image file types are supported and how large can the file be?

Any format your browser can decode is accepted: JPG, JPEG, PNG, GIF (the first frame), BMP, WebP, and most HEIC files on macOS / iOS Safari. There is no fixed server limit because nothing is uploaded, but in practice files above 20 MB or photos larger than 4000 pixels on a side may slow the decode on mobile phones. The model itself resizes the input to 224 x 224 or 384 x 384 internally before captioning, so a higher-resolution source does not improve caption quality. For best results, keep the image well-lit, in focus, and with the main subject occupying at least 20 percent of the frame.

Is my image uploaded to a server? How private is this tool?

Your image is never sent to any server. The picture is decoded into a Canvas inside the page, the captioning model weights are downloaded once from a public CDN (jsDelivr / Hugging Face), and inference runs entirely on your CPU or GPU through WebAssembly or WebGPU. You can verify this in the Network tab of your browser's DevTools: after the model files have loaded, generating more captions creates zero new requests. This makes the tool safe for personal photos, medical images, family pictures of children, and confidential corporate screenshots. Once the model is in the browser cache the tool also works fully offline.

How long does the first caption take and why?

The very first time you click Generate, the browser has to download approximately 250 MB of model weights from the CDN and compile them for WebGPU or WebAssembly. On a fast home connection this takes 30 to 90 seconds; on slower mobile networks it can be 2 to 3 minutes. After that the weights live in your browser cache and the model is hot in memory, so subsequent captions usually finish in 1 to 4 seconds on a modern laptop with WebGPU and 5 to 15 seconds on a CPU-only WebAssembly fallback. If you reload the page the cache is reused, but a brand-new browser profile or a cleared cache will trigger a fresh download.

AI Image Caption Generator — Free AI image caption generator running offline in your browser via the vit-gpt2-image-captioning model. Get alt text an — **AI Image Caption Generator**

Which browsers and devices work best?

The tool runs in all modern evergreen browsers: Chrome 113+, Edge 113+, Firefox (WASM only for now), and Safari 17+. WebGPU acceleration is currently best supported in Chrome and Edge on desktop and on newer Android phones; Safari has experimental WebGPU support that may need to be enabled in Develop > Experimental Features. On iOS and on older Android, the tool falls back to WebAssembly which still works but is slower. A laptop or desktop with at least 8 GB of RAM gives the smoothest experience because the model and intermediate tensors together use around 1 GB. Older phones with limited RAM may not be able to load the model at all.

Why did I get a vague caption, and what can I do about it?

Captioning models work best on common, well-photographed scenes: outdoor shots, food, animals, sports, vehicles, and people doing everyday activities. They struggle with abstract art, screenshots of charts or text, heavily edited collages, and unusual angles. If you get a generic caption like "a picture of something", try a clearer crop where the main subject fills the frame, increase the lighting, or remove visual clutter. The model also cannot read words inside an image (for that you would use the OCR / Image-to-Text tool) and it cannot identify specific named people or brands, which is by design for privacy. For multilingual captions, run the English output through a translator; the underlying vit-gpt2-image-captioning weights are English-only.

How accurate is it, and when should I edit the caption before using it?

The on-device vit-gpt2-image-captioning model produces a single short, generic English sentence that is correct often but not always. Treat its output as a draft, not a final answer. Concrete limitations: it does not perform OCR, so it cannot transcribe text, signs, logos, or numbers in the image; it does not identify named people, brands, or places; it is English-only and tends to produce one plain descriptive sentence rather than rich, context-aware copy. For accessibility and compliance work (WCAG alt text, government or e-commerce requirements), always review and edit the suggestion: add the purpose of the image, any text it contains, and context the model cannot see. The tool speeds up writing alt text and SEO descriptions, but it is not a substitute for a human in regulated or high-stakes contexts.

Can I get multiple caption suggestions or control the caption length?

Yes. Before clicking Generate you can choose how many suggestions to produce (1, 3, or 5) and a length preset: Short for compact alt text, Medium for a balanced caption, or Long for a more descriptive sentence. Requesting more than one suggestion runs beam search on the same model, returns several distinct phrasings, and lists them as clickable rows; clicking any row loads it into the editable caption box so you can copy, download, or refine it. This is ideal for professionals tagging photo libraries or writing alt text who want to pick the best wording in a single pass instead of re-running. Everything still runs locally on the in-browser vit-gpt2 model, so generating extra suggestions downloads no additional weights and sends nothing to a server.