Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

2412.01819

Anton Voronov et el.

🤗 2024-12-03

TL;DR
#

Current text-to-image generation methods face challenges in balancing speed, image quality, and model complexity. Autoregressive models, while offering a theoretically unified framework for vision and language, often lag behind diffusion models in both speed and visual quality. Existing scale-wise models, while promising, exhibit convergence issues and suboptimal performance.

This paper introduces SWITTI, a novel scale-wise transformer designed to overcome these limitations. The researchers improved model stability and convergence through architectural modifications. By analyzing self-attention maps, they discovered a weak dependence on preceding scales in the pretrained model, leading to a non-autoregressive version (SWITTI) that improves speed and image quality. Further optimizations involved disabling classifier-free guidance at high-resolution scales. Extensive evaluation, including human preference studies, demonstrated SWITTI’s superior performance and efficiency compared to state-of-the-art models.

Key Takeaways
#

Why does it matter?
#

This paper is important because it significantly advances text-to-image generation, offering a faster and potentially higher-quality alternative to existing methods. Its focus on scale-wise transformers presents a novel approach with implications for other generative AI tasks. The findings on classifier-free guidance and autoregressive components provide valuable insights for optimizing model efficiency and quality, opening new avenues of research in model architecture and training strategies. The work’s open-source nature promotes further development and collaboration within the AI community.

Visual Insights
#

🔼 This figure showcases the high-quality and aesthetically pleasing 512x512 pixel images generated by the SWITTI model. The remarkable speed of image generation, approximately 0.13 seconds per image, is a key highlight of the model’s capabilities. The images demonstrate the model’s ability to translate text prompts into detailed and visually appealing artwork.
read the caption
Figure 1: Switti produces high quality and aesthetic 512×512512512512{\times}512512 × 512 image samples in around 0.130.130.130.13 seconds.

Model	PSNR ↑	SSIM ↑	LPIPS ↓	CLIP-IQA ↑
Original	21.603	0.634	0.200	0.727
Fine-tuned	22.267	0.653	0.188	0.772

🔼 This table presents a quantitative comparison of the performance of an original RQ-VAE (Residual Quantization Variational Autoencoder) and a fine-tuned version. The RQ-VAE is a crucial component of the SWITTI model, responsible for encoding and decoding images into a latent representation. The fine-tuning process aimed to optimize the decoder for better performance at a resolution of 512x512 pixels. The comparison uses four metrics: PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index), LPIPS (Learned Perceptual Image Patch Similarity), and CLIP-IQA (CLIP Image Quality Assessment). Higher values are generally better for PSNR and SSIM, while lower values are better for LPIPS. CLIP-IQA is a no-reference metric, where higher values are preferred. This comparison helps showcase the effectiveness of fine-tuning the RQ-VAE for improved image reconstruction quality within the SWITTI architecture.
read the caption
Table 1: Comparison of a fine-tuned and an original RQ-VAE.

In-depth insights
#

Scale-Wise AR
#

Scale-wise autoregressive (AR) models offer a compelling approach to image generation by predicting image scales progressively, starting from low resolution and iteratively refining details. This contrasts with traditional pixel-by-pixel AR methods or fixed-resolution diffusion models. The hierarchical nature inherently mirrors the human visual system’s coarse-to-fine processing, potentially leading to faster convergence and improved generation quality. However, a key challenge is the computational cost of attending to all previously generated scales. Efficient attention mechanisms are crucial; if self-attention maps reveal weak dependence on earlier scales, simplifying the architecture by removing autoregressive connections may significantly improve efficiency without sacrificing performance. Careful consideration of classifier-free guidance (CFG) application across scales is also vital, potentially allowing for further speedups by selectively disabling CFG at higher resolutions where its impact is minimal. Overall, scale-wise AR models present a promising direction in image generation, requiring further investigation into optimized architectures and training strategies to fully realize their potential.

SWITTI’s Design
#

SWITTI’s design is a novel approach to text-to-image synthesis that cleverly addresses limitations of traditional autoregressive models. The core innovation lies in its scale-wise transformer architecture, which generates images progressively at increasing resolutions. This differs from traditional methods by predicting higher-resolution images based on lower-resolution ones. This efficient strategy not only speeds up image generation but also enables better handling of fine details. Removing the autoregressive component further enhances efficiency, as it reduces computational overhead. The study also reveals that high-resolution scales minimally benefit from classifier-free guidance, resulting in another performance optimization by disabling it. Therefore, SWITTI strikes a balance between high-quality image generation and speed, outperforming existing autoregressive models and achieving comparable performance to state-of-the-art diffusion models while being up to 7x faster. The architecture’s success underscores the potential of optimizing autoregressive approaches for large-scale image synthesis.

CFG Ablation
#

The study of CFG (classifier-free guidance) ablation in the context of text-to-image synthesis is crucial for understanding its impact on generation quality and efficiency. Ablation experiments systematically remove or alter CFG at different resolution scales, allowing researchers to isolate its effects. Results likely show that CFG’s benefit diminishes at higher resolutions where fine details are already well-defined by the model; removing CFG in these later stages could significantly improve inference speed without sacrificing significant image quality. This would be a strong argument for using CFG strategically, applying it primarily to lower resolutions where the text’s influence on the overall image structure is stronger. The balance between enhanced image quality due to CFG and computational cost is an essential trade-off to analyze. Therefore, the findings likely present a compelling case for adaptive CFG application, adjusting its intensity and presence across scales to optimize both speed and image quality. This approach would make the model more efficient and potentially cost-effective for deploying in real-world applications.

Human Eval
#

In a research paper, a section dedicated to ‘Human Evaluation’ would hold significant weight, offering crucial insights beyond automated metrics. It would involve human assessors judging various aspects of the generated images (e.g., aesthetics, relevance to prompts, presence of defects, and complexity). This human-centric approach is vital because it addresses the limitations of automated metrics, which often fail to capture subjective qualities like artistic merit or the nuanced details perceived by humans. The results would likely be presented as statistical summaries, possibly including error bars reflecting the variability in human judgments. Such a study would provide a more holistic understanding of the model’s performance, particularly in comparison to other models, possibly showcasing whether the model excels in specific areas (e.g., generating highly realistic images versus aesthetically pleasing ones). A well-executed human evaluation is crucial for determining the true value and practical utility of a text-to-image synthesis model, and it complements the quantitative results found using automated metrics.

Future Work
#

Future research directions for scale-wise transformers in text-to-image synthesis are promising. Improving hierarchical tokenizers is crucial, as current methods lag behind continuous or single-level alternatives, leading to artifacts in high-frequency image details. Developing more effective RQ-VAEs for higher resolutions (e.g., 1024x1024) is necessary to fully leverage the speed and scalability advantages of this approach. Exploring alternative architectures beyond the transformer framework might unlock further performance gains. Investigating the influence of various training strategies (e.g., different loss functions, training data) and the impact on both speed and quality remains an open area. Finally, rigorous comparative studies across a wider range of state-of-the-art models are needed to fully assess the capabilities and limitations of scale-wise transformers compared to diffusion models. This includes a focus on robust evaluation metrics that encompass both quantitative and qualitative aspects of image quality.

More visual insights
#

More on figures

🔼 This figure details the architecture of a transformer block within the SWITTI model. It shows the flow of information through multiple components, including multi-head self-attention and cross-attention mechanisms, feed-forward networks (FFNs) with SwiGLU activation, layer normalization (LN), and RMSNorm layers. The figure highlights how textual information from CLIP ViT-L and ViT-bigG encoders, along with positional embeddings and cropping parameters, is integrated into the model’s processing. This detailed illustration helps to understand how the model processes input and generates the output image.
read the caption
Figure 2: Transformer block in the Switti model.

🔼 This figure shows the activation norms of the last transformer block over the course of training. It illustrates how these norms change across various model architectures. Specifically, the plot tracks the evolution of activation norms to reveal stability issues and the effectiveness of modifications made to address these issues. Different colored lines represent different model variations (e.g., using BF16 or FP32 precision, different normalization techniques, and different activation functions). The x-axis represents the training iteration, and the y-axis shows the activation norm (log scale).
read the caption
Figure 3: Last transformer block activation norms over training.

🔼 This figure shows the training curves for different model architectures, comparing metrics such as CLIP score, FID, and PickScore over training iterations. It helps to visualize and analyze the performance and convergence of different architectural choices during the training process.
read the caption
Figure 4: Evaluation metrics over training for various architectures.

🔼 This figure compares the self-attention masks used in the VAR (Visual Autoregressive) and SWITTI models. The left panel shows the block-wise causal mask of VAR, illustrating that the model attends only to previously generated tokens within a block. This is a standard approach in autoregressive models for image generation where the model predicts the next token(s) based on previous ones. The right panel displays the block-wise non-causal mask of SWITTI, showcasing that, unlike VAR, SWITTI’s self-attention is not restricted to previously generated tokens within a block. This change indicates that SWITTI predicts tokens across all scales concurrently, rather than sequentially. This difference in the attention mechanisms reflects the core architectural distinction between VAR and SWITTI—autoregressive prediction versus non-autoregressive parallel prediction.
read the caption
Figure 5: Visualization of the block-wise self-attention masks in VAR (Left) and Switti (Right).

🔼 This figure visualizes the self-attention maps of the scale-wise autoregressive transformer model, SWITTI (AR), at different scales. Each map shows the attention weights between different image tokens across various scales. The key observation highlighted is that the model’s attention is predominantly focused on the current scale being processed, with significantly weaker attention given to preceding scales. This indicates a reduced reliance on previously generated scales during the image generation process.
read the caption
Figure 6: Visualization of self-attention maps at different scales for Switti (AR). The model attends mostly to the current scale.

🔼 This figure visualizes the cross-attention maps across different scales of the SWITTI model for a randomly selected text prompt. The heatmaps show the attention weights between text embeddings and image tokens at each scale. Darker colors indicate stronger attention. Analyzing these maps reveals how the model’s attention shifts across scales in response to the text prompt, offering insights into how the model integrates textual information across different levels of image detail during generation.
read the caption
Figure 7: Visualization of cross-attention maps at different scales for a random text prompt.

🔼 This figure shows the results of an experiment where the textual prompt was switched during the image generation process at different scales. The results demonstrate that the model’s reliance on the text decreases as the resolution increases. This behavior suggests that higher-resolution scales are primarily determined by lower resolution scales rather than the textual input alone.
read the caption
Figure 8: Textual prompt switching during image generation.

🔼 This figure displays the results of a human evaluation study comparing the performance of SWITTI with other leading autoregressive (AR) and diffusion-based text-to-image models. The models are evaluated across four key aspects: relevance, aesthetics, complexity, and the presence of defects. The chart shows the average scores for each model and aspect, with error bars representing 95% confidence intervals, providing a visual representation of the statistical significance of the results. This allows for a direct comparison of SWITTI’s performance against established methods in terms of human perception of image quality.
read the caption
Figure 9: Human study comparing Switti with competing AR, diffusion-based models. Error bars correspond to a 95% confidence interval.

🔼 This figure presents a qualitative comparison of images generated by SWITTI and other state-of-the-art models, including HART, DMD2, SDXL, and SD3. For each model, four different image samples are shown, illustrating the models’ capabilities in generating diverse and detailed images from varied text prompts. The prompts cover a range of styles and subject matter, allowing for a visual assessment of each model’s strengths and weaknesses in terms of image quality, detail, and adherence to the prompt.
read the caption
Figure 10: Qualitative comparison of Switti against the baselines.

🔼 This figure shows a comparison of image generation results with and without classifier-free guidance (CFG) enabled at the final scales of the generation process. The top row displays images generated with CFG enabled, showing some artifacts in the fine details. The bottom row displays images generated with CFG disabled at the last scales, demonstrating that disabling CFG can mitigate these artifacts and improve the quality of fine details.
read the caption
Figure 11: Illustrative examples when disabled CFG at the last scales (Bottom) mitigates the artifacts in fine-grained details (Top).

🔼 This figure details the architecture of a transformer block used in the SWITTI model. It shows the flow of information through multiple components, including multi-head self-attention and cross-attention mechanisms, feed-forward networks (FFNs), layer normalization (LN), and RMSNorm layers. The figure highlights how text embeddings from CLIP ViT-L and ViT-bigG are incorporated, along with positional embeddings. The use of SwiGLU activation within the FFN is also depicted. This block is a fundamental building block repeated within the larger SWITTI model, and the diagram helps clarify the internal processing and information flow within each block.
read the caption
Figure 12: Transformer block of the basic architecture.

🔼 This figure shows a visual comparison of the image reconstruction quality between the original RQ-VAE model (left) and the RQ-VAE model with a fine-tuned decoder (right). The fine-tuning process improved the quality of image reconstruction in terms of color accuracy and the reduction of artifacts.
read the caption
Figure 13: Visual comparison between original RQ-VAE (left) and with fine-tuned decoder (right).

🔼 This figure shows a qualitative comparison of images generated by SWITTI and several baseline models on different prompts. Each row represents a different prompt, and each column displays images generated by a different model (SWITTI, HART, DMD2, SDXL, and SD3). This allows for a visual comparison of the image quality and style produced by each model in response to various prompts.
read the caption
Figure 14: Qualitative comparison of Switti against the baselines.

🔼 This figure demonstrates the effect of classifier-free guidance (CFG) at different resolution levels during image generation. The experiment uses 10 resolution levels, with higher indices representing higher resolutions. Each row shows the results for a different image prompt, and each column represents the results obtained with CFG applied up to a specific level. The guidance scale is kept constant at 6 for all experiments. By comparing the images across columns in each row, one can observe how the level of CFG application affects the quality and detail of the generated images. This visualization helps to assess the impact of CFG at various resolutions, showing whether higher-resolution CFG application is necessary or even beneficial.
read the caption
Figure 15: Impact of CFG at different resolution levels. There are 10101010 levels: higher index indicates higher resolution. The guidance scale is 6666.

🔼 This figure shows the results of an experiment where the text prompt is changed during image generation. For each prompt, images are generated with the prompt switching at different scales (2nd, 4th, 5th, 6th, 8th, and 9th). The images showcase how the change in prompt affects the final image at different stages of the generation process, highlighting the impact of prompt switching at various resolutions.
read the caption
Figure 16: Impact of the prompt switching during image generation.

🔼 This figure shows the results of an experiment where the text prompt was changed during the image generation process at different scales. It demonstrates how the model’s response varies depending on when the prompt switch occurs. Specifically, it shows that changes made in earlier scales have a larger impact on the final image than changes introduced in later scales, suggesting that the model’s reliance on the text prompt diminishes as the resolution increases. Two example prompts are used: one for an image of a cat and one for an image of a forest.
read the caption
Figure 17: Impact of the prompt switching during image generation.

🔼 This figure shows the interface used in a human evaluation study to assess the aesthetic quality of images generated by different models. The evaluators were presented with pairs of images and asked to compare them based on multiple aspects of visual aesthetics, including brightness and contrast, presence of unnatural colors, glow effects, the visibility and number of main and secondary objects, background and environment, and the overall level of detail. The evaluators could provide a nuanced rating for each aspect by selecting options such as ‘Image 1 is better,’ ‘Image 2 is better,’ or ‘The images are equal in this aspect.’
read the caption
Figure 18: Human evaluation interface for aesthetics.

🔼 This figure shows the user interface used for the human evaluation of image defects. The interface presents two images side-by-side and asks evaluators to assess various aspects of image quality, such as defects in the composition, presence of watermarks, extra objects, and defects in the main or secondary objects. Evaluators select an option indicating which image is better or whether they are comparable. The final decision is based on the aggregated responses across several criteria.
read the caption
Figure 19: Human evaluation interface for defects.

🔼 The figure shows the user interface used for human evaluation of the relevance of generated images to the given text prompt. The interface presents two images side-by-side, allowing evaluators to compare them. Evaluators are asked to select which image is more relevant to the text prompt, indicating their preference through a simple selection mechanism. The interface also allows evaluators to indicate if the images are of comparable quality or if they cannot make a decision. This helps gather fine-grained feedback on model performance regarding text-image alignment.
read the caption
Figure 20: Human evaluation interface for relevance.

🔼 This figure shows the user interface used for the human evaluation of image complexity in the SWITTI paper. The interface presents two images side-by-side to assessors, who are asked to judge which image is more complex based on the given prompt. Assessors can choose between ‘Image 1 is better’, ‘Image 2 is better’, ‘Can’t decide’, or ‘The images are uncomparable’ as options to select and submit their responses.
read the caption
Figure 21: Human evaluation interface for image complexity.

Model	Latency, s/image	Parameters count, B	COCO 30K eval prompts				MJHQ 30K eval prompts				GenEval
Distilled Diffusion Models			Pickscore ↑	CLIP ↑	IR ↑	FID ↓	Pickscore ↑	CLIP ↑	IR ↑	FID ↓	↑
SDXL-Turbo [60]	0.251	2.6B	0.229	0.355	0.83	17.6	0.216	0.365	0.84	15.7	0.55
DMD2 [84]	0.251	2.6B	0.231	0.356	0.87	14.3	0.219	0.374	0.87	7.2	0.58
Diffusion Models
SDXL [51]	0.867	2.6B	0.226	0.360	0.77	14.4	0.217	0.384	0.78	7.6	0.55
SD3 [15]	0.934	2.0B	0.227	0.354	1.01	19.5	0.215	0.363	0.91	13.1	0.65
Lumina-Next [92]	5.812	2.0B	0.224	0.329	0.55	18.4	0.216	0.353	0.75	5.9	0.47
Autoregressive Models
LlamaGen [68]	3.821	0.8B	0.208	0.274	-0.25	44.8	0.194	0.288	-0.45	26.9	0.32
HART [69]	0.063	0.7B	0.223	0.341	0.75	20.9	0.216	0.366	0.84	5.8	0.55
Switti (AR) (ours)	0.139	2.5B	0.227	0.354	0.91	18.1	0.217	0.378	0.88	9.5	0.61
Switti (ours)	0.127	2.5B	0.227	0.356	0.95	17.6	0.217	0.381	0.91	9.5	0.63

Model	Generator size, B	N steps	1 step, ms/image	Full, s/image
Distilled Diffusion Models
SDXL-Turbo	2.6	4	12.4	0.251
DMD2	2.6	4	12.4	0.251
Diffusion Models
SDXL	2.6	25	12.4	0.867
SD3	2.0	28	16.8	0.934
Autoregressive Models
Lumina-mGPT	7.0	1024	—	224.2
LlamaGen	0.8	1024	—	3.821
HART	0.7	10	4.7*	0.063
Switti (AR)	2.5	10	11.2*	0.139
Switti	2.5	10	9.5*	0.127

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Scale-Wise AR#

SWITTI’s Design#

CFG Ablation#

Human Eval#

Future Work#

More visual insights#

Full paper#