timhigins
4 days ago
The coolest thing here might be the speed: for a given scene RenderFormer takes 0.0760 seconds while Blender Cycles takes 3.97 seconds (or 12.05 secs at a higher setting), while retaining a 0.9526 Structural Similarity Index Measure (0-1 where 1 is an identical image). See tables 2 and 1 in the paper.
This could possibly enable higher quality instant render previews for 3D designers in web or native apps using on-device transformer models.
Note the timings above were on an A100 with an unoptimized PyTorch version of the model. Obviously the average user's GPU is much less powerful, and for 3D designers it might be still powerful enough to see significant speedups over traditional rendering. Or for a web-based system it could even connect to A100s on the backend and stream the images to the browser.
Limitations are that it's not fully accurate especially as scene complexity scales, e.g. with shadows of complex shapes (plus I imagine particles or strands), so the final renders will probably still be done traditionally to avoid any of the nasty visual artifacts common in many AI-generated images/videos today. But who knows, it might be "good enough" and bring enough of a speed increase to justify use by big animation studios who need to render full movie-length previews to use for music, story review, etc etc.
OtherShrezzing
4 days ago
I don’t think the authors are being wilfully deceptive in any way, but Blender Cycles on a gpu of that quality could absolutely render every scene in this paper in less than 4s per frame. There are very modest tech demo scenes with low complexity, and they’ve set blender to cycle through 4k iterations per pixel - which seems non-sensible as Blender would hit something close to its output after a couple of hundred cycles, and then burn gpu cycles for the next 3800 cycles making no improvements.
I think they’ve inadvertently included Blender’s instantiation phase in the overall rendering time, while not including the transformer instantiation.
I’d be interested to see the time to render the second frame for each system. My hunch is that Blender would be a lot more performant.
I do think the papers results are fascinating in general, but there’s some nuance in the way they’ve configured and timed Blender.
jsheard
4 days ago
Also of note is that the RenderFormer tests and Blender tests were done on the same Nvidia A100, which sounds sensible at first glance, but doesn't really make sense because Nvidia's big-iron compute cards (like the A100) lack the raytracing acceleration units present on the rest of their range. The A100 is just the wrong tool for the job here, you'd get vastly better Blender-performance-per-dollar from an Nvidia RTX card.
Blenders benchmark database doesn't have any results for the A100, but even the newer H100 gets smoked by (relatively) cheap consumer hardware.
Nvidia H100 NVL - 5,597.13
GeForce RTX 3090 Ti - 5,604.69
Apple M3 Ultra (80C) - 7,319.21
GeForce RTX 4090 - 11,082.51
GeForce RTX 5090 - 15,022.02
RTX PRO 6000 Blackwell - 16,336.54
rcxdude
4 days ago
Yeah, you would generally set blender to have some low minimum number of cycles, maybe have some adaptive noise target, and use a denoising model, especially for preview or draft renders.
ttoinou
4 days ago
But rendering engines have been optimized for years and this is a research paper. Probably this technique will also be optimized for years and provide a 10x speedup again
qayxc
3 days ago
Sure, but algorithmic complexity beats linear factors, so unless they somehow manage to get from O(N²) to O(log N) for triangle count, this technique cannot ever even come close to established traditional approaches; no matter the linear improvement.
buildartefact
4 days ago
For the scenes that they’re showing, 76ms is an eternity. Granted, it will get (a lot) faster but this being better than traditional rendering is a way off yet.
jsheard
4 days ago
Yeah, and the big caveat with this approach is that it scales quadratically with scene complexity, as opposed to the usual methods which are logarithmic. Their examples only have 4096 triangles at most for that reason. It's a cool potential direction for future research but there's a long way to go before it can wrangle real production scenes with hundreds of millions of triangles.
monster_truck
4 days ago
I'd sooner expect them to use this to 'feed' a larger neural path tracing engine where you can get away with 1 sample every x frames. Those already do a pretty great job of generating great looking images from what seems like noise.
I don't think this conventional similarity matrix in the paper is all that important to them
leloctai
4 days ago
Timing comparison with the reference is very disingenuous.
In raytracing, error scale with the square root of sample count. While it is typical to use very high sample count for the reference, real world sample count for offline renderer is about 1-2 orders of magnitude lower than in this paper.
I call it disingenuous because it is very usual for a graphic paper to include a very high sample count reference image for quality comparison, but nobody ever do timing comparison with it.
Since the result is approximate, a fair comparison would be with other approximate rendering algorithm. Modern realtime path tracer + denoiser can render much more complex scenes on consumer GPU in less than 16ms.
That's "much more complex scenes" part is the crucial part. Using transformer mean quadratic scaling on both number of triangles and number of output pixels. I'm not up to date with the latest ML research, so maybe it is improved now? But I don't think it will ever beat O(log n_triangles) and O(n_pixels) theoretical scaling of a typical path tracer. (Practical scaling wrt pixel count is sub linear due to high coherency of adjacent pixels)
cubefox
4 days ago
Modern optimized path tracers in games (probably not Blender) also use rasterization for primary visibility, which is O(n_triangles), but is somehow even faster than doing pure path tracing. I guess because is reduces the number of samples required to resolve high frequency texture details. Global illumination by itself tends to produce very soft (low frequency) shadows and highlights, so not a lot of samples are required in theory, when the denoiser can avoid artifacts at low sample counts.
But yeah, no way RenderFormer in its current state can compete with modern ray tracing algorithms. Though the machine learning approach to rendering is still in its infancy.
cubefox
4 days ago
> The runtime-complexity of attention layers scales quadratically with the number of tokens, and thus triangles in our case. As a result, we limit the total number of triangles in our scenes to 4,096;
kilpikaarna
4 days ago
> The coolest thing here might be the speed: for a given scene RenderFormer takes 0.0760 seconds while Blender Cycles takes 3.97 seconds (or 12.05 secs at a higher setting), while retaining a 0.9526 Structural Similarity Index Measure (0-1 where 1 is an identical image). See tables 2 and 1 in the paper.
This sounds pretty wild to me. Scanned through it quickly but I couldn't find any details on how they set this up. Do they use the CPU or the Cuda kernel on an A100 for Cycles? Also, if this is doing single frames an appreciable fraction of the 3.97s might go into firing up the renderer. Time-per-frame would drop off if rendering a sequence.
And the complexity scaling per triangle mentioned in a sibling comment. Ouch!
fulafel
4 days ago
This reads like they used the GPU with Cycles:
"Table 2 compares the timings on the four scenes in Figure 1 of our
unoptimized RenderFormer (pure PyTorch implementation without
DNN compilation, but with pre-caching of kernels) and Blender Cy-
cles with 4,096 samples per pixel (matching RenderFormer’s training
data) at 512 × 512 resolution on a single NVIDIA A100 GPU."
esperent
4 days ago
> Blender Cy- cles with 4,096 samples per pixel (matching RenderFormer’s training
This seems like an unfair comparison. It would be a lot more useful to know how long it would take Blender to also reach a 0.9526 Structural Similarity Index Measure to the training data. My guess is that with the de-noiser turned on, something like 128 samples would be enough, or maybe even less on some images. At that point on an A100 GPU Blender would be close, if not beating the times here for these scenes.
Kubuxu
4 days ago
Nobody runs 4096 samples per pixel. In many cases 100-200 (or even less with denoising) are enough. You might run up to low-1000 if you want to resolve caustics.
jiggawatts
4 days ago
I wonder if the model could be refined on the fly by rendering small test patches using traditional methods and using that as the feedback for a LoRA tuning layer or some such.
timhigins
3 days ago
Thanks for these comments! Seems their measurement of Blender is off and we need some more in-depth benchmarks.