2 Apr 2026

Do You Still Need ControlNet? Testing Next-Gen Models on Viewer Scenes

thumbnail

Introduction

In our previous blog post, we showed how to use Stable Diffusion with ControlNet on Comfy UI to turn Autodesk Viewer scenes into AI-rendered images. The key finding was clear: without depth and normal maps from the Viewer, the AI hallucinates — windows fill with nonsense, stairs vanish, and structural details get lost. ControlNet, guided by those maps, keeps the output faithful to your design.

The pipeline haven't changed — but the models have. A new generation of image models (Flux, Qwen, and Google's Nano Banana/Gemini) has arrived, and they're significantly more capable than what we used before.

The question is: do these newer models still need ControlNet and depth maps, or are they good enough on their own?

Now there's also Comfy Cloud, which we are using to run these workflows.

This post puts all three through the same Viewer-to-render pipeline to find out.

Background: Why Depth Maps Matter

If you're new to this workflow, here's the short version: we extract a screenshot, depth map, and normal map from an Autodesk Viewer scene (using the sample at aps-viewer-ai-render). The depth and normal maps feed into ControlNet, which constrains the AI model so it preserves the geometry and spatial layout of your design.

Without ControlNet, you're handing the model a flat screenshot and a text prompt — it will produce something visually appealing but structurally unreliable. With ControlNet, the model knows where walls are, how deep a room is, and where furniture sits.

The question with the newer models is:

 

 .... "does this still hold true, and how do the models compare?" 🤔

 

Prerequisites

  1. An active Comfy Cloud account (the free tier works for initial testing).

  2. Access to your designs through a Viewer scene, with depth and normal map extraction set up per the aps-viewer-ai-render sample.

Here's a walkthrough of the setup:

Setup walkthrough

The Models

Flux (with ControlNet)

Flux is the open-source model from Black Forest Labs. It supports ControlNet, so we feed it Viewer depth maps to guide generation.

What to expect: Strong structural fidelity. The depth map keeps walls, floors, and furniture where they belong. Flux handles realistic materials and lighting well, making it a solid default choice when you need the output to respect your design intent.

Flux demo


Nano Banana — Gemini (no ControlNet)

Nano Banana is Google's entry, and it's become popular for good reason — the output quality from just an input image and a text prompt is impressive.

The catch: It doesn't support ControlNet. That means no depth map guidance. The model works entirely from the screenshot and your description. On visually simple scenes, this can still produce great results. On structurally complex designs — tight corridors, multi-level spaces, detailed furniture layouts — you may see the model take liberties with geometry.

What to expect: High visual quality, but less structural control. Best suited for early-stage mood boards or scenes where exact spatial accuracy isn't critical.

Nano Banana demo


Qwen (with ControlNet)

Released by Alibaba's Qwen team, this model supports ControlNet and can be integrated into a workflow similar to Flux. Qwen is particularly strong at text rendering and image editing tasks.

What to expect: Comparable structural fidelity to Flux when using depth maps. Qwen's text rendering capability is a differentiator — if your scene includes signage, labels, or annotations, Qwen handles them more cleanly than the other models.

Qwen demo

Side-by-Side: Same Scene, Four Models

To see how these models actually differ, we ran the same Viewer scene — a multi-level interior with stairs, mezzanine, and railings — through each workflow. Here are the inputs and outputs.

Inputs

The two inputs extracted from the Viewer scene: the diffuse screenshot (what the camera sees) and the depth map (which encodes spatial distance for ControlNet).

Diffuse Screenshot Depth Map
viewer screenshot depth map

Outputs

All four outputs from the same scene. Flux was run twice with different settings to show variation. Notice how the ControlNet-guided models (Flux, Qwen) preserve the staircase geometry and railing positions, while Nano Banana reinterprets them more freely.

Flux (variation 1) Flux (variation 2)
flux variation 1 flux variation 2
Nano Banana (Gemini) Qwen
nano banana qwen

What to notice:

  • Flux & Qwen (with ControlNet): The staircase, mezzanine railing, and hallway corridor closely match the original layout. The depth map keeps everything anchored.

  • Nano Banana (no ControlNet): The most photorealistic-looking result — it adds a potted plant and polished materials — but the stair geometry shifts, railing styles change, and the spatial proportions are reinterpreted. It looks great as a standalone image, but it's no longer your design.

  • Flux variation 1 vs 2: Same model, different parameters. Shows how much tuning affects the output even with identical ControlNet guidance — one leans warmer with more wood detail, the other is more minimal and neutral.

How They Compare

  Flux Nano Banana (Gemini) Qwen
ControlNet support Yes No Yes
Depth map guidance Yes No — image + prompt only Yes
Structural fidelity High Variable — depends on scene complexity High
Visual quality Realistic materials and lighting Strong overall aesthetics Good, with superior text rendering
Best for Design-accurate renders Quick mood boards, simple scenes Scenes with text/signage, editing tasks
Tradeoff Requires depth map extraction No structural guarantees Similar setup complexity to Flux

Conclusion

Short answer: yes, you still need ControlNet — if the output needs to respect the geometry of your design. The newer models are better, but "better" without depth map constraints still means the model is free to reimagine your floor plan.

Use Flux or Qwen with ControlNet when accuracy matters. Use Nano Banana when you want a quick, visually impressive render and don't need spatial precision.

All three workflows are linked above — clone them, swap in your own Viewer scenes, and tweak the parameters. The best way to evaluate these models is on your own data.


The work shared in this blog post has been presented at:

 

Related Article