SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control

1Adobe Research   2The Australian National University


SmartMask for multi-object insertion. We observe that prior state-of-the-art image inpainting methods either lead to 1) incorrect objects (missing dog) or visual-artifacts (e.g., people facing bench's back) when adding all objects at once (row-2, row-4), or, 2) introduce inconsistency-artifacts (e.g., woman and dog in front of bench in row-3, dog and woman not sitting on same bench in row-5) when adding objects sequentially. Furthermore, the generated objects (man, woman, and dog in row-5) can appear non-interacting when generated in a sequential manner. SmartMask helps address this by allowing the user to first add a coherent sequence of context-aware object masks, before using SDXL-based-ControlNet-Inpaint model to perform precise object insertion for multiple objects.



Abstract

The field of generative image inpainting and object insertion has made significant progress with the recent advent of latent diffusion models. Utilizing a precise object mask can greatly enhance these applications. However, due to the challenges users encounter in creating high-fidelity masks, there is a tendency for these methods to rely on more coarse masks (e.g., bounding box) for these applications. This results in limited control and compromised background content preservation. To overcome these limitations, we introduce SmartMask, which allows any novice user to create detailed masks for precise object insertion. Combined with a ControlNet-Inpaint model, our experiments demonstrate that SmartMask achieves superior object insertion quality, preserving the background content more effectively than previous methods. Notably, unlike prior works the proposed approach can also be used even without user-mask guidance, which allows it to perform mask-free object insertion at diverse positions and scales. Furthermore, we find that when used iteratively with a novel instruction-tuning based planning model, SmartMask can be used to design detailed layouts from scratch. As compared with user-scribble based layout design, we observe that SmartMask allows for better quality outputs with layout-to-image generation methods.

SmartMask for Mask-free Object Insertion


Method Overview. A key idea behind SmartMask is to leverage semantic amodal segmentation data in order to obtain high-quality paired training annotations for mask-free single or multi-step object insertion. During training (top), given a training image I with caption C, we stack k ordered instance maps {A1, A2, ..., Ak} to obtain an intermediate semantic map Sk. The diffusion model is then trained to predict the instance map Ak+1, conditional on the semantic map Sk, Tobj ← Ok+1 and scene context Tcontext ← C. During inference (bottom), given a real image I, we first use a panoptic segmentation model to compute semantic map SI. The generated semantic layout is then directly used as input to the trained diffusion model in order to predict the fine-grained mask for the inserted object. The predicted precise-object mask is then used along with SDXL-based ControlNet-Inpaint model to insert the target object in the original input image.

Single Object Insertion

Comparison with prior works. We observe that as compared to with state-of-the-art image inpainting methods, SmartMask allows the user to perform object insertion while better preserving the background around the inserted object

Multiple Object Insertion

SmartMask for multi-object insertion. We observe that prior state-of-the-art image inpainting methods either lead to incorrect objects (row-2, row-4) when adding all objects at once, or, introduce artifacts (e.g., blurred woman in row-3, man's face and woman's dress in row-5) when adding objects sequentially. SmartMask helps address this by allowing the user to first add a sequence of context-aware object masks, before using SDXL-based-ControlNet-Inpaint model to perform precise object insertion for multiple objects.

Mask-free Insertion

SmartMask for mask-free object insertion. We observe that unlike prior image-inpainting methods which rely on user coarse masks for object location and scale, SmartMask also allows for mask-free object insertion. This allows the user to generate diverse object insertion suggestions for putting the target object (e.g., ship) in the input image at different positions and scales. Note that the masks are generated in a scene-aware manner, and can therefore account for the existing scene elements (e.g., man lying on bed in row-5, car riding down the road in row-3, etc.). Also notice that the object insertion suggestions are generated at different scales: thus objects close to camera are larger and away from camera are smaller (e.g., motorbike parked beside a car in row-4).

BibTeX

If you find our work useful in your research, please consider citing:
@article{singh2023smartmask,
      title={SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control},
      author={Singh, Jaskirat and Zhang, Jianming and Liu, Qing and Smith, Cameron and Lin, Zhe and Zheng, Liang},
      journal={arXiv preprint arXiv:2312.05039},
      year={2023}  
    }