A Controllable Appearance Representation for Flexible Transfer and Editing

Abstract

We present a method that computes an interpretable representation of material appearance within a highly compact, disentangled latent space. This representation is learned in a self-supervised fashion using a VAE-based model. We train our model with a carefully designed unlabeled dataset, avoiding possible biases induced by human-generated labels. Our model demonstrates strong disentanglement and interpretability by effectively encoding material appearance and illumination, despite the absence of explicit supervision. To showcase the capabilities of such a representation, we leverage it for two proof-of-concept applications: image-based appearance transfer and editing. Our representation is used to condition a diffusion pipeline that transfers the appearance of one or more images onto a target geometry, and allows the user to further edit the resulting appearance. This approach offers fine-grained control over the generated results: thanks to the well-structured compact latent space, users can intuitively manipulate attributes such as hue or glossiness in image space to achieve the desired final appearance.

Method

Appearance Encoder

We present an adapted version of the FactorVAE architecture [Kim2018] for appearance-geometry disentanglement, together with a modified training loss, that helps learn an interpretable latent space via self-supervised learning. Left: diagram of the modified architecture. Right: Prior traversals plot of the proposed Appearance Encoder.

Diffusion Pipeline

Diffusion-based pipeline for proof-of-concept applications of our space. Our proposed pipeline uses two branches to condition the diffusion-based generative process with Stable Diffusion XL (SDXL). The appearance conditioning branch leverages our encoder to produce a 6D feature vector representing the desired appearance. This representation can be further edited along each of the six dimensions if desired, providing fine-grained control over the final appearance. The geometry branch leverages ControlNet to condition generation through Canny edges and depth information. We show here appearance transfer from an input image (bunny) to a target one (David), as well as editing along different dimensions of the latent space (right). Other uses, such as direct editing or selective transfer, are also possible.

BibTeX

@inproceedings{10.2312:sr.20251187,
        booktitle = {Eurographics Symposium on Rendering},
        editor = {Wang, Beibei and Wilkie, Alexander},
        title = {{A Controllable Appearance Representation for Flexible Transfer and Editing}},
        author = {Jimenez-Navarro, Santiago and Guerrero-Viu, Julia and Masia, Belen},
        year = {2025},
        publisher = {The Eurographics Association},
        ISSN = {1727-3463},
        ISBN = {978-3-03868-292-9},
        DOI = {10.2312/sr.20251187}
        }

Acknowledgements

This work has been partially supported by grant PID2022-141539NBI00, funded by MICIU/AEI/10.13039/501100011033 and by ERDF, EU, and by the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 956585 (PRIME). Julia Guerrero-Viu was also partially supported by the FPU20/02340 predoctoral grant. We thank Daniel Martin for his help designing the final figures, Daniel Subias for proofreading the manuscript, and the members of the Graphics and Imaging Lab for insightful discussions. We also thank the people who participated in the user study, and the I3A (Aragon Institute of Engineering Research) for the use of its HPC cluster HERMES.