Static Images Are Dead. Google's New AI Just Taught Itself to Zoom, Annotate, and Calculate On the Fly

Liang Wei

Translate this article

Updated:

January 30, 2026

Google has introduced a new capability for its Gemini 3 Flash model called Agentic Vision. This feature aims to shift image understanding from a single glance to an active, step-by-step investigation by combining visual analysis with code execution.

The core idea is that instead of making a best guess from a static image, the model can now formulate a plan to manipulate the image to find answers. It follows a Think, Act, Observe loop:

1. Think: Analyzes the query and image to create a multi-step plan.

2. Act: Generates and runs Python code to, for example, zoom in on a specific area, annotate parts of the image, or perform calculations.

3. Observe: Reviews the newly transformed or analyzed image before providing a final, grounded response.

Google states that enabling this code execution provides a consistent 5-10% quality improvement across most vision benchmarks for Gemini 3 Flash.

Highlighted Use Cases:

· Zooming and Inspecting: For tasks requiring fine detail, the model can write code to crop and analyze high-resolution sections of an image. A building plan validation platform, PlanCheckSolver.com, reportedly improved accuracy by 5% using this to inspect specific structural elements.

· Image Annotation: The model can execute code to draw bounding boxes or labels directly on an image to ground its reasoning, such as counting items by visually marking each one.

· Visual Math and Plotting: It can extract data from charts or tables and use Python to perform calculations or generate new visualizations, aiming to replace estimation with verifiable computation.

Availability and Future:

Agentic Vision is available now via the Gemini API in Google AI Studio and Vertex AI, and is starting to roll out in the Gemini app when the "Thinking" mode is selected. Google notes they are working to make more image manipulation behaviors implicit (not requiring a user prompt) and plan to expand this capability to other model sizes beyond Gemini 3 Flash.

About the Author

Liang Wei

Liang Wei is our AI correspondent from China