Computer-generated imagery is now ubiquitous in our society, spanning fields such as games and movies, architecture, engineering, or virtual prototyping, while also helping create novel ones such as computational materials. With the increase in computational power and the improvement of acquisition techniques, there has been a paradigm shift in the field towards data-driven techniques, which has yielded an unprecedented level of realism in visual appearance.
Unfortunately, this leads to a series of problems: First, there is a disconnect between the mathematical representation of the data and any meaningful parameters that humans understand; the captured data is machine-friendly, but not human friendly. Second, the many different acquisition systems lead to heterogeneous formats and very large datasets. And third, real world appearance functions are usually nonlinear and high-dimensional. As a result, visual appearance datasets are increasingly unfit to editing operations, which limits the creative process for scientists, engineers, artists and practitioners in general. There is an immense gap between the complexity, realism and richness of the captured data, and the flexibility to edit such data. This line of research plans to bridge this gap, putting the user at the core. Achieving our goals will finally enable us to reach the true potential of real-world captured datasets in many aspects of society.
Abstract: Generative models have enabled intuitive image creation and manipulation using natural language. In particular, diffusion models have recently shown remarkable results for natural image editing. In this work, we propose to apply diffusion techniques to edit textures, a specific class of images that are an essential part of 3D content creation pipelines. We analyze existing editing methods and show that they are not directly applicable to textures, since their common underlying approach, manipulating attention maps, is unsuitable for the texture domain. To address this, we propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation. We define editing directions using simple text prompts (e.g., "aged wood" to "new wood") and map these to CLIP image embedding space using a texture prior, with a sampling-based approach that gives us identity-preserving directions in CLIP space. To further improve identity preservation, we project these directions to a CLIP subspace that minimizes identity variations resulting from entangled texture attributes. Our editing pipeline facilitates the creation of arbitrary sliders using natural language prompts only, with no ground-truth annotated data necessary."
Abstract: This work presents a perceptually-motivated manifold for translucent appearance, designed for intuitive editing of translucent materials by navigating through the manifold. Classic tools for editing translucent appearance, based on the use of sliders to tune a number of parameters, are challenging for non-expert users: these parameters have a highly non-linear effect on appearance, and exhibit complex interplay and similarity relations between them. Instead, we pose editing as a navigation task in a low-dimensional space of appearances, which abstracts the user from the underlying optical parameters. To achieve this, we build a low-dimensional continuous manifold of translucent appearances that correlates with how humans perceive these types of materials. We first analyze the correlation of different distance metrics with human perception. We select the best-performing metric to build a low-dimensional manifold, which can be used to navigate the space of translucent appearance. To evaluate the validity of our proposed manifold within its intended application scenario, we build an editing interface that leverages the manifold, and relies on image navigation plus a fine-tuning step to edit appearance. We compare our intuitive interface to a traditional, slider-based one in a user study, demonstrating its effectiveness and superior performance when editing translucent objects.
Abstract: Estimating perceptual attributes of materials directly from images is a challenging task due to their complex, not fully-understood interactions with external factors, such as geometry and lighting. Supervised deep learning models have recently been shown to outperform traditional approaches, but rely on large datasets of human-annotated images for accurate perception predictions. Obtaining reliable annotations is a costly endeavor, aggravated by the limited ability of these models to generalise to different aspects of appearance. In this work, we show how a much smaller set of human annotations (strong labels) can be effectively augmented with automatically derived weak labels in the context of learning a low-dimensional image-computable gloss metric. We evaluate three alternative weak labels for predicting human gloss perception from limited annotated data. Incorporating weak labels enhances our gloss prediction beyond the current state of the art. Moreover, it enables a substantial reduction in human annotation costs without sacrificing accuracy, whether working with rendered images or real photographs.
Abstract: We introduce text2fabric, a novel dataset that links free-text descriptions to various fabric materials. The dataset comprises 15,000 natural language descriptions associated to 3,000 corresponding images of fabric materials. Traditionally, material descriptions come in the form of tags/keywords, which limits their expressivity, induces pre-existing knowledge of the appropriate vocabulary, and ultimately leads to a chopped description system. Therefore, we study the use of free-text as a more appropriate way to describe material appearance, taking the use case of fabrics as a common item that non-experts may often deal with. Based on the analysis of the dataset, we identify a compact lexicon, set of attributes and key structure that emerge from the descriptions. This allows us to accurately understand how people describe fabrics and draw directions for generalization to other types of materials. We also show that our dataset enables specializing large vision-language models such as CLIP, creating a meaningful latent space for fabric appearance, and significantly improving applications such as fine-grained material retrieval and automatic captioning.
Abstract: In everyday photography, physical limitations of camera sensors and lenses frequently lead to a variety of degradations in captured images such as saturation or defocus blur. A common approach to overcome these limitations is to resort to image stack fusion, which involves capturing multiple images with different focal distances or exposures. For instance, to obtain an all-in-focus image, a set of multi-focus images is captured. Similarly, capturing multiple exposures allows for the reconstruction of high dynamic range. In this paper, we present a novel approach that combines neural fields with an expressive camera model to achieve a unified reconstruction of an all-in-focus high-dynamic-range image from an image stack. Our approach is composed of a set of specialized implicit neural representations tailored to address specific sub-problems along our pipeline: We use neural implicits to predict flow to overcome misalignments arising from lens breathing, depth, and all-in-focus images to account for depth of field, as well as tonemapping to deal with sensor responses and saturation -- all trained using a physically inspired supervision structure with a differentiable thin lens model at its core. An important benefit of our approach is its ability to handle these tasks simultaneously or independently, providing flexible post-editing capabilities such as refocusing and exposure adjustment. By sampling the three primary factors in photography within our framework (focal distance, aperture, and exposure time), we conduct a thorough exploration to gain valuable insights into their significance and impact on overall reconstruction quality. Through extensive validation, we demonstrate that our method outperforms existing approaches in both depth-from-defocus and all-in-focus image reconstruction tasks. Moreover, our approach exhibits promising results in each of these three dimensions, showcasing its potential to enhance captured image quality and provide greater control in post-processing.
Abstract: A faithful reproduction of gloss is inherently difficult because of the limited dynamic range, peak luminance, and 3D capabilities of display devices. This work investigates how the display capabilities affect gloss appearance with respect to a real-world reference object. To this end, we employ an accurate imaging pipeline to achieve a perceptual gloss match between a virtual and real object presented side-by-side on an augmented-reality high-dynamic-range (HDR) stereoscopic display, which has not been previously attained to this extent. Based on this precise gloss reproduction, we conduct a series of gloss matching experiments to study how gloss perception degrades based on individual factors: object albedo, display luminance, dynamic range, stereopsis, and tone mapping. We support the study with a detailed analysis of individual factors, followed by an in-depth discussion on the observed perceptual effects. Our experiments demonstrate that stereoscopic presentation has a limited effect on the gloss matching task on our HDR display. However, both reduced luminance and dynamic range of the display reduce the perceived gloss. This means that the visual system cannot compensate for the changes in gloss appearance across luminance (lack of gloss constancy), and the tone mapping operator should be carefully selected when reproducing gloss on a low dynamic range (LDR) display.
Abstract: Intuitively editing the appearance of materials from a single image is a challenging task given the complexity of the interactions between light and matter, and the ambivalence of human perception. This problem has been traditionally addressed by estimating additional factors of the scene like geometry or illumination, thus solving an inverse rendering problem and subduing the final quality of the results to the quality of these estimations. We present a single-image appearance editing framework that allows us to intuitively modify the material appearance of an object by increasing or decreasing high-level perceptual attributes describing such appearance (e.g., glossy or metallic). Our framework takes as input an in-the-wild image of a single object, where geometry, material, and illumination are not controlled, and inverse rendering is not required. We rely on generative models and devise a novel architecture with Selective Transfer Unit (STU) cells that allow to preserve the high-frequency details from the input image in the edited one. To train our framework we leverage a dataset with pairs of synthetic images rendered with physically-based algorithms, and the corresponding crowd-sourced ratings of high-level perceptual attributes. We show that our material editing framework outperforms the state of the art, and showcase its applicability on synthetic images, in-the-wild real-world photographs, and video sequences.
Abstract: Most in-the-wild images are stored in Low Dynamic Range (LDR) form, serving as a partial observation of the High Dynamic Range (HDR) visual world.Despite limited dynamic range, these LDR images are often captured with different exposures, implicitly containing information about the underlying HDR image distribution.Inspired by this intuition, in this work we present, to the best of our knowledge, the first method for learning a generative model of HDR images from in-the-wild LDR image collections in a fully unsupervised manner. The key idea is to train a generative adversarial network (GAN) to generate HDR images which, when projected to LDR under various exposures, are indistinguishable from real LDR images.The projection from HDR to LDR is achieved via a camera model that captures the stochasticity in exposure and camera response function.Experiments show that our method GlowGAN can synthesize photorealistic HDR images in many challenging cases such as landscapes, lightning, or windows, where previous supervised generative models produce overexposed images. With the assistance of GlowGAN, we showcase the novel application of unsupervised inverse tone mapping (GlowGAN-ITM) that sets a new paradigm in this field. Unlike previous methods that gradually complete information from LDR input, GlowGAN-ITM searches the entire HDR image manifold modeled by GlowGAN for the HDR images which can be mapped back to the LDR input. GlowGAN-ITM achieves more realistic reconstruction of overexposed regions compared to state-of-the-art supervised learning models, despite not requiring HDR images or paired multi-exposure images for training.
Abstract: Our visual perception of the world is strongly influenced by material appearance. Humans can easily recognize and discriminate materials, despite the influence on their final appearance of confounding factors such as illumination or surface geometry. However, understanding material appearance and perceived properties such as glossiness remains challenging. Recent literature has shown how unsupervised generative neural networks can spontaneously learn perceptually-meaningful latent representations from simple stimuli renderings of bumpy surfaces, and cluster them according to glossiness despite receiving no explicit information about it. Furthermore, those representations correlate better with human perception of gloss than the physical parameters of the materials, suggesting that our brains may decipher glossiness by learning the statistical structure of images. In this work, we analyze the performance of such unsupervised learning models on a wider variety of complex real-world images, including realistic object geometries, real environment maps, and measured materials. We train a PixelVAE generative network in an unsupervised manner on a dataset containing three different geometries under three different illuminations, using more than 300 materials. We study the latent representations found by our model without receiving any prior knowledge. Our results show that the model clusters the stimuli hierarchically, suggesting that geometry could be the most relevant appearance factor, followed by illumination. This is different from previous experiments using abstract bumpy surfaces, where the role of geometry was less prominent due to the randomness of the bumps. Finally, we analyze how our (unsupervised) learned latent representations correlate with human ratings of glossiness perception, showing a reasonable organization despite the complex interactions with geometry and lightness. In conclusion, our results suggest that unsupervised learning representations may help to understand human visual perception of material appearance even in the presence of complex stimuli.
Abstract: A good match of material appearance between real-world objects and their digital on-screen representations is critical for many applications such as fabrication, design, and e-commerce. However, faithful appearance reproduction is challenging, especially for complex phenomena, such as gloss. In most cases, the view-dependent nature of gloss and the range of luminance values required for reproducing glossy materials exceeds the current capabilities of display devices. As a result, appearance reproduction poses significant problems even with accurately rendered images. This paper studies the gap between the gloss perceived from real-world objects and their digital counterparts. Based on our psychophysical experiments on a wide range of 3D printed samples and their corresponding photographs, we derive insights on the influence of geometry, illumination, and the display's brightness and measure the change in gloss appearance due to the display limitations. Our evaluation experiments demonstrate that using the prediction to correct material parameters in a rendering system improves the match of gloss appearance between real objects and their visualization on a display device.
Abstract: High Dynamic Range (HDR) content is becoming ubiquitous due to the rapid development of capture technologies. Neverthe-less, the dynamic range of common display devices is still limited, therefore tone mapping (TM) remains a key challenge forimage visualization. Recent work has demonstrated that neural networks can achieve remarkable performance in this task whencompared to traditional methods, however, the quality of the results of these learning-based methods is limited by the train-ing data. Most existing works use as training set a curated selection of best-performing results from existing traditional tonemapping operators (often guided by a quality metric), therefore, the quality of newly generated results is fundamentally limitedby the performance of such operators. This quality might be even further limited by the pool of HDR content that is used fortraining. In this work we propose a learning-based self-supervised tone mapping operator that is trained at test time specificallyfor each HDR image and does not need any data labeling. The key novelty of our approach is a carefully designed loss functionbuilt upon fundamental knowledge on contrast perception that allows for directly comparing the content in the HDR and tonemapped images. We achieve this goal by reformulating classic VGG feature maps into feature contrast maps that normalizelocal feature differences by their average magnitude in a local neighborhood, allowing our loss to account for contrast maskingeffects. We perform extensive ablation studies and exploration of parameters and demonstrate that our solution outperformsexisting approaches with a single set of fixed parameters, as confirmed by both objective and subjective metrics.
Abstract: Despite advances in display technology, many existing applications rely on psychophysical datasets of human perception gathered using older, sometimes outdated displays. As a result, there exists the underlying assumption that such measurements can be carried over to the new viewing conditions of more modern technology. We have conducted a series of psychophysical experiments to explore contrast sensitivity using a state-of-the-art HDR display, taking into account not only the spatial frequency and luminance of the stimuli but also their surrounding luminance levels. From our data, we have derived a novel surroundaware contrast sensitivity function (CSF), which predicts human contrast sensitivity more accurately. We additionally provide a practical version that retains the benefits of our full model, while enabling easy backward compatibility and consistently producing good results across many existing applications that make use of CSF models. We show examples of effective HDR video compression using a transfer function derived from our CSF, tone-mapping, and improved accuracy in visual difference prediction.
Abstract: Single-image appearance editing is a challenging task, traditionally requiring the estimation of additional scene properties such as geometry or illumination. Moreover, the exact interaction of light, shape and material reflectance that elicits a given perceptual impression is still not well understood. We present an image-based editing method that allows to modify the material appearance of an object by increasing or decreasing high-level perceptual attributes, using a single image as input. Our framework relies on a two-step generative network, where the first step drives the change in appearance and the second produces an image with high-frequency details. For training, we augment an existing material appearance dataset with perceptual judgements of high-level attributes, collected through crowd-sourced experiments, and build upon training strategies that circumvent the cumbersome need for original-edited image pairs. We demonstrate the editing capabilities of our framework on a variety of inputs, both synthetic and real, using two common perceptual attributes (Glossy and Metallic), and validate the perception of appearance in our edited images through a user study.
Abstract: Translucent materials are ubiquitous in the real world, from organic materials such as food or human skin, to synthetic materials like plastic or rubber. While multiple models for translucent materials exist, understanding how we perceive translucent appearance, and how it is affected by illumination and geometry, remains an open problem. In this work, we analyze how well human observers esti- mate the density of translucent objects for static and dynamic illu- mination scenarios. Interestingly, our results suggest that dynamic illumination may not be critical to assess the nature of translucent materials.
Abstract: Material appearance hinges on material reflectance properties but also surface geometry and illumination. The unlimited number of potential combinations between these factors makes understanding and predicting material appearance a very challenging task. In this work, we collect a large-scale dataset of perceptual ratings of appearance attributes with more than 215,680 responses for 42,120 distinct combinations of material, shape, and illumination. The goal of this dataset is twofold. First, we analyze for the first time the effects of illumination and geometry in material perception across such a large collection of varied appearances. We connect our findings to those of the literature, discussing how previous knowledge generalizes across very diverse materials, shapes, and illuminations. Second, we use the collected dataset to train a deep learning architecture for predicting perceptual attributes that correlate with human judgments. We demonstrate the consistent and robust behavior of our predictor in various challenging scenarios, which, for the first time, enables estimating perceived material attributes from general 2D images. Since our predictor relies on the final appearance in an image, it can compare appearance properties across different geometries and illumination conditions. Finally, we demonstrate several applications that use our predictor, including appearance reproduction using 3D printing, BRDF editing by integrating our predictor in a differentiable renderer, illumination design, or material recommendations for scene design
Abstract: We present a single-image data-driven method to automatically relight images with full-body humans in them. Our framework is based on a realistic scene leveraging precomputed radiance transfer (PRT) and spherical harmonics (SH) lighting. In contrast to previous work, we lift the assumptions on Lambertian materials and explicitly model diffuse and specular reflectance in our data. Moreover, we introduce an additional light-dependent residual term that accounts for errors in the PRTbased image reconstruction. We propose a new deep learning architecture, tailored to the decomposition performed in PRT, that is trained using a of L1, logarithmic, and rendering losses. Our model outperforms the state of the art for full-body human relighting both with synthetic images and photographs.
Abstract: Painters are masters in replicating the visual appearance of materials. While the perception of material appearance is not yet fully understood, painters seem to have acquired an implicit understanding of the key visual cues that we need to accurately perceive material properties. In this study, we directly compare the perception of material properties in paintings and in renderings, by collecting professional realistic paintings of rendered materials. From both type of images, we collect human judgments of material properties and compute a variety of image features that are known to reflect material properties. Our study reveals that, despite important visual differences between the two types of depiction, material properties in paintings and renderings are perceived very similarly and are linked to the same image features. This suggests that we use similar visual cues independently of the medium and that the presence of such cues is sufficient to provide a good appearance perception of the materials
Abstract: Observing and recognizing materials is a fundamental part of our daily life. Under typical viewing conditions, we are capable of effortlessly identifying the objects that surround us and recognizing the materials they are made of. Nevertheless, understanding the underlying perceptual processes that take place to accurately discern the visual properties of an object is a long-standing problem. In this work, we perform a comprehensive and systematic analysis of how the interplay of geometry, illumination, and their spatial frequencies affect human performance on material recognition tasks. We carry out large-scale behavioral experiments where participants are asked to recognize different reference materials among a pool of candidate samples. In the different experiments, we carefully sample the information in the frequency domain of the stimuli. From our analysis, we find significant first-order interactions between the geometry and the illumination, of both the reference and the candidates. In addition, we observe that simple image statistics and higher-order image histograms do not correlate with human performance, therefore, we perform a high-level comparison of highly non-linear statistics by training a deep neural network on material recognition tasks. Our results show that such models can accurately classify materials, which suggests that they are capable of defining a meaningful representation of material appearance from labeled proximal image data. Last, we find preliminary evidence that these highly non-linear models and humans may use similar high-level factors for material recognition tasks.
Abstract: Establishing a robust measure for material similarity that correlates well with human perception is a long-standing problem. A recent work presented a deep learning model trained to produce a feature space that aligns with human perception by gathering human subjective measures. The resulting metric outperforms objective existing ones. In this work, we aim to understand whether this increased performance is a result of using human perceptual data or is due to the nature of feature learnt by deep learning models. We train similar networks with objective measures (BRDF similarity or classification task) and show that these networks can predict human judgements as well, suggesting that the non-linear features learnt by convolutional network might be a key to model material perception.
Abstract: We present a model to measure the similarity in appearance between different materials, which correlates with human similarity judgments. We first create a database of 9,000 rendered images depicting objects with varying materials, shape and illumination. We then gather data on perceived similarity from crowdsourced experiments; our analysis of over 114,840 answers suggests that indeed a shared perception of appearance similarity exists. We feed this data to a deep learning architecture with a novel loss function, which learns a feature space for materials that correlates with such perceived appearance similarity. Our evaluation shows that our model outperforms existing metrics. Last, we demonstrate several applications enabled by our metric, including appearance-based search for material suggestions, database visualization, clustering and summarization, and gamut mapping.
Abstract: We analyze the effect of motion in the perception of material appearance. First, we create a set of stimuli containing 72 realistic materials, rendered with varying degrees of linear motion blur. Then we launch a large-scale study on Mechanical Turk to rate a given set of perceptual attributes, such as brightness, roughness, or the perceived strength of reflections. Our statistical analysis shows that certain attributes undergo a significant change, varying appearance perception under motion. In addition, we further investigate the perception of brightness, for the particular cases of rubber and plastic materials. We create new stimuli, with ten different luminance levels and seven motion degrees. We launch a new user study to retrieve their perceived brightness. From the users’ judgements, we build two-dimensional maps showing how perceived brightness varies as a function of the luminance and motion of the material.
Abstract: Accurately modeling how light interacts with cloth is challenging, due to the volumetric nature of cloth appearance and its multiscale structure, where microstructures play a major role in the overall appearance at higher scales. Recently, significant effort has been put on developing better microscopic models for cloth structure, which have allowed rendering fabrics with unprecedented fidelity. However, these highly-detailed representations still make severe simplifications on the scattering by individual fibers forming the cloth, ignoring the impact of fibers' shape, and avoiding to establish connections between the fibers' appearance and their optical and fabrication parameters. In this work we put our focus in the scattering of individual cloth fibers; we introduce a physically-based scattering model for fibers based on their low-level optical and geometric properties, relying on the extensive textile literature for accurate data. We demonstrate that scattering from cloth fibers exhibits much more complexity than current fiber models, showing important differences between cloth type, even in averaged conditions due to longer views. Our model can be plugged in any framework for cloth rendering, matches scattering measurements from real yarns, and is based on actual parameters used in the textile industry, allowing predictive bottom-up definition of cloth appearance.
Abstract: Reproducing the appearance of real-world materials using current printing technology is problematic. The reduced number of inks available define the printer’s limited gamut, creating distortions in the printed appearance that are hard to control. Gamut mapping refers to the process of bringing an out-of-gamut material appearance into the printer’s gamut, while minimizing such distortions as much as possible. We present a novel two-step gamut mapping algorithm that allows users to specify which perceptual attribute of the original material they want to preserve (such as brightness, or roughness). In the first step, we work in the low-dimensional intuitive appearance space recently proposed by Serrano et al., and adjust achromatic reflectance via an objective function that strives to preserve certain attributes. From such intermediate representation, we then perform an image-based optimization including color information, to bring the BRDF into gamut. We show, both objectively and through a user study, how our method yields superior results compared to the state of the art, with the additional advantage that the user can specify which visual attributes need to be preserved. Moreover, we show how this approach can also be used for attribute-preserving material editing.
Abstract: During the last few years, many different techniques for measuring material appearance have arisen. These advances have allowed the creation of large public datasets, and new methods for editing BRDFs of captured appearance have been proposed. However, these methods lack intuitiveness and are hard to use for novice users. In order to overcome these limitations, Serrano et al. recently proposed an intuitive space for editing captured appearance. They make use of a representation of the BRDF based on a combination of principal components (PCA) to reduce dimensionality, and then map these components to perceptual attributes. This PCA representation is biased towards specular materials and fails to represent very diffuse BRDFs, therefore producing unpleasant artifacts when editing. In this paper, we build on top of their work and propose to use two separate PCA bases for representing specular and diffuse BRDFs, and map each of these bases to the perceptual attributes. This allows us to avoid artifacts when editing towards diffuse BRDFs. We then propose a new method for effectively navigate between both bases while editing based on a new measurement of the specularity of measured materials. Finally, we integrate our proposed method in an intuitive BRDF editing framework and show how some of the limitations of the previous model have been overcomed with our representation. Moreover, our new measure of specularity can be used on any measured BRDF, as it is not limited only to MERL BRDFs.
Abstract: Many different techniques for measuring material appearance have been proposed in the last few years. These have produced large public datasets, which have been used for accurate, data-driven appearance modeling. However, although these datasets have allowed us to reach an unprecedented level of realism in visual appearance, editing the captured data remains a challenge. In this paper, we present an intuitive control space for predictable editing of captured BRDF data, which allows for artistic creation of plausible novel material appearances, bypassing the difficulty of acquiring novel samples. We first synthesize novel materials, extending the existing MERL dataset up to 400 mathematically valid BRDFs. We then design a large-scale experiment, gathering 56,000 subjective ratings on the high-level perceptual attributes that best describe our extended dataset of materials. Using these ratings, we build and train networks of radial basis functions to act as functionals mapping the perceptual attributes to an underlying PCA-based representation of BRDFs. We show that our functionals are excellent predictors of the perceived attributes of appearance. Our control space enables many applications, including intuitive material editing of a wide range of visual properties, guidance for gamut mapping, analysis of the correlation between perceptual attributes, or novel appearance similarity metrics. Moreover, our methodology can be used to derive functionals applicable to classic analytic BRDF representations. We release our code and dataset publicly, in order to support and encourage further research in this direction.