SVGCraft: Beyond Single Object Text-to-SVG Synthesis with Comprehensive Canvas Layout

1Computer Vision Center, UAB; 2IIIT Guwahati; 3CVPR Unit, ISI Kolkata; 4University of Surrey

A brief overview of SVGCraft: Our method, SVGCraft synthesizes vector sketches conditioned on an input text prompt through layout guidance generated via an LLM.


Concise Overview

Generating VectorArt from text prompts is a challenging vision task, requiring diverse yet realistic depictions of the seen as well as unseen entities. However, existing research has been mostly limited to the generation of single objects, rather than comprehensive scenes comprising multiple elements. In response, this work introduces SVGCraft, a novel end-to-end framework for the creation of vector graphics depicting entire scenes from textual descriptions. Utilizing a pre-trained LLM for layout generation from text prompts, this framework introduces a technique for producing masked latents in specified bounding boxes for accurate object placement. It introduces a fusion mechanism for integrating attention maps and employs a diffusion U-Net for coherent composition, speeding up the drawing process. The resulting SVG is optimized using a pre-trained encoder and LPIPS loss with opacity modulation to maximize similarity. Additionally, this work explores the potential of primitive shapes in facilitating canvas completion in constrained environments. Through both qualitative and quantitative assessments, SVGCraft is demonstrated to surpass prior works in abstraction, recognizability, and detail, as evidenced by its performance metrics (CLIP-T: 0.4563, Cosine Similarity: 0.6342, Confusion: 0.66, Aesthetic: 6.7832).


From Prose to Portraits: Mechanism of Sketch Synthesis

SVGCraft employs an LLM as a layout generator to create layouts with "background prompt", "grounding object" and their corresponding bounding boxes, generating masked latents for each box with controlled attention for accurate object placement. These latents are fused to initialize the SVG canvas and used by a diffusion U-Net for coherent image generation \( \mathcal{I}_m \) that aligns with the layout. The final canvas is obtained via maximizing its similarity between \( \mathcal{I}_r \) and \( \mathcal{R}_d (\theta) \) using LPIPS loss and opacity control.
This opacity control aims to replicate the iterative approach of traditional human sketching by starting with a low opacity value for the Bezier curves. The system then increases the opacity of certain strokes based on relevance and significance provided by the semantic guidance through a loss function \( \mathcal{L}_{sop} \) as defined in \( \mathcal{L}_\text{sop} = \left|1 - \dfrac{\max(A_s \odot \mathcal{R}_d(\theta))}{\max(A_s \odot \mathcal{I}_r)}\right| \). The system updates the opacity value of primitives in each backward pass through gradient descent to mirror the artistic method of intensifying strokes. The final optimization has been defined in the following equation:
$$\mathcal{L}_\text{synth} = \text{LPIPS}(\mathcal{I}_r, R_d (\theta)) + \lambda_\text{sop} \mathcal{L}_\text{sop}$$


From Doodles to Details: Evolution of Sketch Synthesis

SVGCraft synthesizes aesthetic vector graphics with Bezier curves and simple primitives via iterative optimization, aiming to explore their evolution under projective/affine transformation to obtain the final SVG as demonstrated in the following figure. Here, densely initialized strokes with per-box mask latent attention (see iter. 1 in the following figure) are going through the iterative canvas \( \mathcal{C} \) optimization via \( \mathcal{L}_\text{synth} \) in order to obtain the final SVG.


motorcycle
“Two giraffes and three elephants in a forest”

Final takeaways


We introduced the notion of SVG synthesis through optimization which reflects the enumeration and spatial relationship between multiple objects described in a text prompt. To this end, we proposed SVGCraft, an architecture that synthesizes free-hand vector graphics via projective transformations of the B\'{e}zier curves or affine transformation of the primitive shapes leveraging novel techniques such as Layout correction, per box mask latent-based canvas initialization, Semantic aware canvas initialization and so on. All these techniques collectively enhance the model efficiency and output quality by replicating the drawing style of a human. Extensive experiments and ablation studies confirm the model's superiority over existing methods, demonstrating its ability to generate aesthetically pleasing and semantically rich SVGs while maintaining their spatial relationship. However, it is unable to draw human faces properly of the supplementary material), which we will address in our future work.

BibTeX


      @article{banerjee2024svgcraft,
        title={SVGCraft: Beyond Single Object Text-to-SVG Synthesis with Comprehensive Canvas Layout},
        author={Banerjee, Ayan and Mathur, Nityanand and Llad{\'o}s, Josep and Pal, Umapada and Dutta, Anjan},
        journal={arXiv preprint arXiv:2404.00412},
        year={2024}
      }