We present a method that computes an interpretable representation of material appearance within a highly compact, disentangled latent space. This representation is learned in a self-supervised fashion using a VAE-based model. We train our model with a carefully designed unlabeled dataset, avoiding possible biases induced by human-generated labels. Our model demonstrates strong disentanglement and interpretability by effectively encoding material appearance and illumination, despite the absence of explicit supervision. To showcase the capabilities of such a representation, we leverage it for two proof-of-concept applications: image-based appearance transfer and editing. Our representation is used to condition a diffusion pipeline that transfers the appearance of one or more images onto a target geometry, and allows the user to further edit the resulting appearance. This approach offers fine-grained control over the generated results: thanks to the well-structured compact latent space, users can intuitively manipulate attributes such as hue or glossiness in image space to achieve the desired final appearance.
We present an adapted version of the FactorVAE architecture [Kim2018] for appearance-geometry disentanglement, together with a modified training loss, that helps learn an interpretable latent space via self-supervised learning. Left: diagram of the modified architecture. Right: Prior traversals plot of the proposed Appearance Encoder.
Diffusion-based pipeline for proof-of-concept applications of our space. Our proposed pipeline uses two branches to condition the diffusion-based generative process with Stable Diffusion XL (SDXL). The appearance conditioning branch leverages our encoder to produce a 6D feature vector representing the desired appearance. This representation can be further edited along each of the six dimensions if desired, providing fine-grained control over the final appearance. The geometry branch leverages ControlNet to condition generation through Canny edges and depth information. We show here appearance transfer from an input image (bunny) to a target one (David), as well as editing along different dimensions of the latent space (right). Other uses, such as direct editing or selective transfer, are also possible.
BibTex code to be updated when notified