Building an Image Caption Generator in C# with Ollama and Microsoft.Extensions.AI

Today, I’m excited to share a practical project: a console app that generates captions for images using LLaVA, a vision-language model powered by Ollama.

We’ll use dependency injection with Microsoft.Extensions.AI to keep things modern and clean.

This app will preload images from a project folder, let users pick one, and spit out a caption. All in a few lines of code.

Whether you’re exploring local AI tools or just want a fun C# challenge, let’s dive in and build it together!

Setting Up Ollama

First, we need Ollama to run LLaVA locally.

Ollama is a lightweight tool that makes hosting models like this a breeze. We’ll start with llava:7b, a smaller version that’s friendly to most machines, and I’ll point out how to level up later.

First thing first: install Ollama

1. Download Ollama

Visit ollama.com and grab the installer for your OS: Windows, Mac, or Linux.

For Linux folks, you can use this terminal command:

curl -fsSL https://ollama.com/install.sh | sh

2. Verify It’s Working

Open a terminal (Command Prompt, PowerShell, or your shell of choice) and run:

ollama --version

You should see a version number. If not, double-check the installation.

3. Pull LLaVA 7B

In your terminal, type:

ollama pull llava:7b

This downloads the 7-billion-parameter version of LLaVA. It's around 4GB, so it’s lighter than the full model. Depending on your internet connection, it can take some minutes.

4. Launch LLaVA

Start it with:

ollama run llava:7b

This fires up a local server at http://localhost:11434. Keep the terminal running while we work.

To make sure all is set, go to http://localhost:11434. You should see a message in your browser saying: Ollama is running

Pro Tip: The llava:7b model works well with modest hardware (8GB RAM minimum, GPU optional). For better accuracy and detail, you can install the full llava model with ollama pull llava. It's heftier (7GB+), so you’ll want 16GB RAM and ideally a GPU with 8GB+ VRAM.

Creating the C# Project

You’re all seasoned C# pros, so I won’t micromanage your setup. Whether you’re rocking Visual Studio, Rider, or VSCode, here’s the gist of what you need to do.

Project Setup

New Console App

Create a new console app in your IDE of choice. I’m targeting .NET 8 for this, but .NET 9, .NET Core or Framework will work too.

I'll name this project: ImageAICaptioner.

Add Dependencies

We’ll need three NuGet packages (make sure you have the "Prerelease" flag on):

Microsoft.Extensions.AI (preview version, e.g., 9.3.0-preview.1.25161.3)

Microsoft.Extensions.AI.Ollama (same version)

Microsoft.Extensions.Hosting (to configure chat client)

Install these via your preferred method, NuGet Package Manager, CLI (dotnet add package), or however you roll.

Set Up an Images Folder

Add a folder named images to your project root.

Toss in a few test images (e.g., dog.jpg, cat.png, car.png).

Make sure they copy to the output directory: set their properties to “Copy if newer” in Visual Studio, or tweak your .csproj if you’re on another IDE.

Writing the Code

Let’s build the app step by step.

We’ll set up dependency injection, preload images from the images folder, let the user choose one, and generate a caption with LLaVA.

I’ll break it down so you can follow along easily.

Step 1: Set Up Dependency Injection

We’ll use Microsoft.Extensions.Hosting to configure a chat client for LLaVA.

Add this at the top of Program.cs:

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.Hosting;

var host = Host.CreateDefaultBuilder(args)
    .ConfigureServices(services =>
    {
        services.AddChatClient(
            new OllamaChatClient(new Uri("http://localhost:11434"), "llava:7b"));
    })
    .Build();

var chatClient = host.Services.GetRequiredService<IChatClient>();

This sets up a host, registers an IChatClient for llava:7b and pulls it from the DI container.

Clean and reusable!

Step 2: Preload and List Images

Next, we’ll scan the images folder and show the user their options:

var imagesFolder = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "images");
if (!Directory.Exists(imagesFolder))
{
    Console.WriteLine("No 'images' folder found! Add one with some images.");
    return;
}

var jpgFiles = Directory.GetFiles(imagesFolder, "*.jpg", SearchOption.TopDirectoryOnly);
var pngFiles = Directory.GetFiles(imagesFolder, "*.png", SearchOption.TopDirectoryOnly);
var imageFiles = jpgFiles.Concat(pngFiles).ToArray();

if (imageFiles.Length == 0)
{
    Console.WriteLine("No images in the 'images' folder. Add some and try again!");
    return;
}

Console.WriteLine("Welcome to the Image AI Caption Generator!");
Console.WriteLine("\nChoose an image to caption:");

for (var i = 0; i < imageFiles.Length; i++)
{
    Console.WriteLine($"{i + 1}. {Path.GetFileName(imageFiles[i])}");
}

This checks for the folder, grabs all .jpg and .png files, and lists them with numbers (e.g., 1. dog.jpg) using System.IO.

Step 3: Handle User Input

Let’s get the user’s choice and validate it:

Console.Write("\nEnter the number of your choice: ");

if (!int.TryParse(Console.ReadLine(), out var choice)
    || choice < 1 || choice > imageFiles.Length)
{
    Console.WriteLine("Invalid choice. Exiting!");
    return;
}

var selectedImage = imageFiles[choice - 1];

Simple error checking ensures the input is a valid number within range, then picks the corresponding file.

Step 4: Generate the Caption

Now, we’ll send the image as a byte array along with the type, and ask LLaVA for a caption:

try
{
    var prompt = new ChatMessage(ChatRole.User, "Describe this image in one sentence.");

    prompt.Contents.Add(
        new DataContent(
            File.ReadAllBytes(selectedImage),
                Path.GetExtension(selectedImage).ToLower() == ".png" ? "image/png" : "image/jpeg"));

    var response = await chatClient.GetResponseAsync(prompt);
    Console.WriteLine($"\nCaption: {response.Messages[0].Text}");
}
catch (Exception ex)
{
    Console.WriteLine($"Oops, something failed: {ex.Message}");
}

Here's where the magic happens 🪄

We're sending the image to LLaVA and asking it to describe what it sees.

First, we create a message with the instruction "Describe this image in one sentence." pretending it's a user talking to the AI (role ChatRole.User).

Then, we grab the raw bytes of the selected image file and attach them to the message, letting LLaVA know whether it’s a PNG or JPEG based on the file extension.

After that, we send this whole package off to the chat client, which talks to LLaVA and waits for a response.

Once we get it back, we pull out the first message's text (our caption) and print it to the console.

If anything goes wrong (like a network hiccup or LLaVA choking on the image), we catch the error and let the user know something failed.

Step 5: Wrap It Up

Finish with a prompt to exit:

Console.WriteLine("Press any key to exit...");
Console.ReadKey();

Full Code

Rather than pasting it all here again, I’ve uploaded the complete project to a GitHub repo.

Grab it here

Running the App

Let’s test it out!

Steps to Run

Start Ollama

Ensure llava:7b is running with ollama run llava:7b in a terminal.

Leave it open.

Launch the App

Build and run your project.

The images folder should copy to your output directory.

Try It

You’ll see something like:

Welcome to the Image AI Caption Generator!

Choose an image to caption:
1. dog.jpg
2. rabbit.jpg
3. car.jpg
4. llama.jpg
5. cat.png

Enter the number of your choice:

Enter 4, and you might get:

Caption: A large llama stands alone in a green field, looking directly at the camera on a clear day with a blue sky.

You can play with it. LLaVA will produce a different caption every time.

Troubleshooting

“No images”: Verify your images folder exists and has files set to copy to output. You can extend it, but this sample reads only .png or jpg images.
Connection issues: Confirm ollama run llava:7b is active and http://localhost:11434 responds (test in a browser).
Performance woes: If it’s sluggish, your hardware might need the lighter llava:7b, or upgrade to llava if you’ve got the specs.
Not correct captions: Remember that this is a small model running locally and works with a few tokens. Depending on the picture clarity or size, you might get different results. If you have the hardware, try to upgrade to llava.

A Note on the Preview Package

Before we wrap up, a quick heads-up: the Microsoft.Extensions.AI and Microsoft.Extensions.AI.Ollama packages we're using are currently in preview, specifically version 9.3.0-preview.1.25161.3 as of this post.

That means the code I've shared here works with that version, but it's provided as-is. Since it's a preview, the API might shift in future releases. Methods like GetResponseAsync or the way we handle image data with DataContent could change as the package matures.

If you're reading this later and things don't quite line up, check the latest docs or NuGet updates for adjustments.

For now, this setup gets us up and running with LLaVA, and I’m excited to see where this library takes us as it evolves!

Why This Matters

Using Microsoft.Extensions.AI with dependency injection isn’t just for show.

It makes swapping tools or scaling up dead simple.

Preloading images keeps it user-friendly without extra hassle.

Let me know how this works for you or if you’ve got ideas to tweak it. There's lot of room to grow!

Happy Coding! ⚡

Building an AI Image Caption Generator with C#