Factorized Visual Tokenization and Generation

2411.16681

Zechen Bai et el.

🤗 2024-11-26

↗ arXiv ↗ Hugging Face ↗ Papers with Code

TL;DR
#

Visual tokenizers are fundamental to image generation, converting images into discrete tokens for processing by transformer-based models. Existing VQ-based methods, however, struggle with large codebooks, leading to instability and performance limitations. Simply increasing codebook size often worsens the issue.

FQGAN introduces a novel solution, Factorized Quantization (FQ), which decomposes a large codebook into smaller, independent sub-codebooks. This reduces complexity and improves stability. FQGAN further incorporates disentanglement regularization to ensure sub-codebooks learn diverse and complementary features, and leverages pretrained vision models for semantic richness. The results show significant improvements in reconstruction quality, exceeding the performance of existing methods, and demonstrate effective adaptation to auto-regressive image generation.

Key Takeaways
#

Why does it matter?
#

This paper is important because it significantly advances visual tokenization, a crucial step in image generation. Its novel Factorized Quantization (FQ) method addresses limitations of existing VQ-based approaches, improving scalability and stability. The results are impactful, achieving state-of-the-art performance in image reconstruction and generation, and opening new research avenues in autoregressive models and semantic feature learning.

Visual Insights
#

🔼 The figure showcases a comparison of reconstruction performance (measured by rFID) across several popular visual tokenizers as their codebook size varies. The tokenizers compared are VQ (from the Taming-Transformers paper), VQ (from LlamaGen), VQ-LC, LFQ (from OpenMAGVIT2), and the novel FQGAN model introduced in this paper. Lower rFID values represent better performance, indicating a higher quality of image reconstruction from the discrete tokens produced by the tokenizer. The graph illustrates how the reconstruction quality changes as the codebook size increases, highlighting the relative strengths and weaknesses of each tokenizer in terms of scalability and accuracy.
read the caption
Figure 1: Performance comparison of popular tokenizers at various codebook sizes, including VQ (Taming) [6], VQ (LlamaGen) [25], VQ-LC [41], LFQ (OpenMAGVIT2) [15], and FQGAN. Lower rFID values indicate better performance.

Method	Downsample	Codebook Size	Code Dim	rFID ↓	PSNR ↑
VQGAN [6]	16	16384	256	4.98	-
SD-VQGAN [23]	16	16384	4	5.15	-
RQ-VAE [12]	16	16384	256	3.20	-
LlamaGen [25]	16	16384	8	2.19	20.79
Titok-B [36]	-	4096	12	1.70	-
VQGAN-LC [41]	16	100000	8	2.62	23.80
VQ-KD [30]	16	8192	32	3.41	-
VILA-U [31]	16	16384	256	1.80	-
Open-MAGVIT2 [15]	16	262144	1	1.17	21.90
FQGAN-Dual	16	16384 × 2	8	0.94	22.02
FQGAN-Triple	16	16384 × 3	8	0.76	22.73
SD-VAE † [23]	8		4	0.74	25.68
SDXL-VAE † [19]	8	-	4	0.68	26.04
ViT-VQGAN [33]	8	8192	32	1.28	-
VQGAN * [6]	8	16384	4	1.19	23.38
SD-VQGAN * [23]	8	16384	4	1.14	-
OmniTokenizer [29]	8	8192	8	1.11	-
LlamaGen [25]	8	16384	8	0.59	25.45
Open-MAGVIT2 [15]	8	262144	1	0.34	26.19
FQGAN-Dual	8	16384 × 2	8	0.32	26.27
FQGAN-Triple	8	16384 × 3	8	0.24	27.58

🔼 This table compares the reconstruction performance of various visual tokenizers on the ImageNet 50k validation set. The models were evaluated using 256x256 pixel images. The table shows the reconstruction performance metrics (rFID, PSNR) for each model, along with relevant hyperparameters, like codebook size and dimensionality, downsampling ratio, and the number of parameters. Note that some models were trained on datasets other than ImageNet (indicated by * and †), and the best and second-best results are highlighted in bold and underlined, respectively.
read the caption
Table 1: Comparisons with other image tokenziers. Reconstruction performance of different tokenizers on 256×256256256256\times 256256 × 256 ImageNet 50k validation set. All models are trained on ImageNet, except “∗*∗” on OpenImages and “††\dagger†” on unknown training data. Bold denotes the best scores; underline denotes the second place.

In-depth insights
#

Factorized Quantization
#

The core concept of “Factorized Quantization” revolves around decomposing a large codebook into multiple smaller, independent sub-codebooks. This addresses the instability and computational challenges associated with using massive codebooks in vector quantization (VQ) for image tokenization. By factorizing, the authors aim to reduce the complexity of the lookup process, making visual tokenization more efficient and scalable. A key innovation is the introduction of disentanglement regularization, which explicitly reduces redundancy between sub-codebooks, ensuring each captures unique and complementary information. The integration of representation learning, using pretrained vision models like CLIP and DINO, further enhances the quality of the learned representations. This combined approach improves reconstruction quality, leading to more expressive and disentangled visual tokens, ultimately benefiting both image reconstruction and autoregressive image generation tasks. The effectiveness of this method is demonstrated through experimental results, showcasing superior performance compared to existing state-of-the-art visual tokenizers.

Disentanglement Reg.
#

The concept of ‘Disentanglement Reg.’ within the context of factorized visual tokenization is crucial for achieving high-quality image generation. The core idea is to prevent the sub-codebooks from learning redundant or overlapping information, ensuring that each sub-codebook specializes in capturing unique aspects of an image. Without disentanglement, the sub-codebooks might learn similar features, leading to a less expressive and less efficient representation. This regularization encourages diversity and promotes complementary feature learning across the sub-codebooks, enabling the model to capture a richer set of visual information. The method’s effectiveness hinges on the careful design of a regularization mechanism that explicitly measures and minimizes redundancy between the sub-codebooks. This might involve techniques like orthogonality constraints or other similarity metrics to ensure independence. The success of ‘Disentanglement Reg.’ is also interconnected with the representation learning objective; both work together to produce semantically meaningful features, resulting in improved reconstruction quality and generalization capabilities. Overall, the effectiveness of ‘Disentanglement Reg.’ is not just about preventing redundancy; it’s about guiding the learning process to produce a more comprehensive, balanced, and interpretable representation of the visual data.

Rep. Learning
#

The heading ‘Rep. Learning’, likely short for ‘Representation Learning’, highlights a crucial aspect of the research. It addresses the inherent limitation of traditional visual tokenizers, which primarily focus on pixel-level reconstruction and often fail to capture the semantic meaning of images. By integrating representation learning, the model is guided to learn richer, more semantically meaningful features. This is achieved by leveraging pre-trained vision models like CLIP and DINOv2, which are already equipped with substantial semantic understanding. This integration ensures the tokenizer learns features beyond superficial details, capturing diverse semantic levels from low-level structures to high-level concepts. This approach not only improves reconstruction quality but also enhances the tokenizer’s ability to generalize and perform well on downstream tasks like image generation. The inclusion of representation learning is key to disentangling the features learned by different sub-codebooks, ensuring each specializes in unique aspects of the image, thus creating a comprehensive and diverse representation. This multifaceted approach addresses the challenge of representing the complex and nuanced nature of visual data.

Autoregressive Gen.
#

Autoregressive generative models, as discussed in the context of image generation, represent a powerful approach that sequentially predicts tokens to construct images or videos. Unlike other methods that produce complete images in one go, autoregressive models build the output step-by-step, offering finer control and often leading to high-quality results. This approach typically involves a transformer network that learns the dependencies between tokens, predicting the next token based on previously generated ones. However, a key challenge lies in efficiently handling a very large codebook, as the size of this codebook directly impacts computational cost and the stability of training. The paper highlights the significant impact of the chosen visual tokenizer on the success of the overall autoregressive generation process. Effective tokenization is crucial because it directly influences the quality and efficiency of the subsequent generation task. The methods proposed in the paper aim to improve both quality and scalability by addressing the limitations of existing visual tokenizers, ultimately enhancing the overall performance of autoregressive image generation models.

VQGAN Enhancements
#

The core of this research lies in enhancing VQGAN (Vector Quantized Generative Adversarial Network), a foundational model for image generation. VQGAN’s limitation stems from its reliance on a fixed-size codebook, which restricts its capacity to represent diverse image features. This paper tackles this challenge by introducing Factorized Quantization (FQ), a novel technique to decompose a large codebook into multiple smaller, independent sub-codebooks. This approach cleverly reduces computational complexity while enhancing the expressiveness of the model. Further enhancing VQGAN, the research integrates disentanglement regularization to reduce redundancy and promote diversity between sub-codebooks. Representation learning is incorporated by leveraging pre-trained models like CLIP and DINO to infuse semantic richness into the generated features, leading to improved reconstruction quality and downstream application capabilities. Overall, the improvements described significantly enhance the scalability, stability, and semantic expressiveness of VQGAN.

More visual insights
#

More on tables

Type	Model	#Para.	FID↓	IS↑	Precision↑	Recall↑
Diffusion	ADM [5]	554M	10.94	101.0	0.69	0.63
	CDM [10]	-	4.88	158.7	-	-
	LDM-4 [23]	400M	3.60	247.7	-	-
	DiT-XL/2 [18]	675M	2.27	278.2	0.83	0.57
LFQ AR	Open-MAGVIT2-B [15]	343M	3.08	258.26	0.85	0.51
	Open-MAGVIT2-L [15]	804M	2.51	271.70	0.84	0.54
VQ AR	VQGAN [6]	227M	18.65	80.4	0.78	0.26
	VQGAN [6]	1.4B	15.78	74.3	-	-
	VQGAN-re [6]	1.4B	5.20	280.3	-	-
	ViT-VQGAN [33]	1.7B	4.17	175.1	-	-
	ViT-VQGAN-re [33]	1.7B	3.04	227.4	-	-
	RQTran. [12]	3.8B	7.55	134.0	-	-
	RQTran.-re [12]	3.8B	3.80	323.7	-	-
	LlamaGen-L [25]	343M	3.80	248.28	0.83	0.51
	LlamaGen-XL [25]	775M	3.39	227.08	0.81	0.54
	FAR-Base	415M	3.38	248.26	0.81	0.54
	FAR-Large	898M	3.08	272.52	0.82	0.54

🔼 Table 2 presents a comparison of class-conditional image generation results on the 256x256 ImageNet dataset. It shows various metrics, including FID (Fréchet Inception Distance) which measures the quality of generated images by comparing them to real images, IS (Inception Score) indicating the diversity and quality of generated samples, Precision, and Recall. Models are categorized based on their architecture and whether or not they utilize rejection sampling, indicated by the “-re” suffix. The evaluation methodology is consistent with the ADM [5] approach and a cfg-scale of 2.0 is used for our model.
read the caption
Table 2: Class-conditional generation on 256×256256256256\times 256256 × 256 ImageNet. Models with the suffix “-re” use rejection sampling. The evaluation protocol and implementation follow ADM [5]. Our model employs a cfg-scale of 2.0.

Model	Codebook Size	Dis.	Rep.	rFID ↓	IS ↑	PSNR ↑	Usage ↑
VQGAN	16384	-	-	3.71	50.05	20.56	98%
	32768	-	-	3.60	50.60	20.56	84%
FQGAN	16384 × 2	✗	✗	2.00	54.72	22.21	97%
	16384 × 2	✓	✗	1.84	55.04	22.04	98%
	16384 × 2	✗	✓	1.73	55.00	21.61	98%
	16384 × 2	✓	✓	1.66	55.21	21.62	98%

🔼 This table presents an ablation study analyzing the impact of different components of the proposed factorized quantization method on the performance of the FQGAN-Dual model. It compares the reconstruction quality (measured by rFID, IS, PSNR) and codebook usage rate across various configurations, including the use of a single codebook (as in standard VQGAN), a factorized codebook without disentanglement or representation learning, and finally the full FQGAN-Dual model which incorporates both disentanglement regularization and representation learning. This allows for a quantitative assessment of the contribution of each proposed technique to the overall improvement in performance.
read the caption
Table 3: Ablation study on different components of the proposed factorized quantization, using the FQGAN-Dual variant.

Generation Model Head	Top-k Sampling	gFID↓
k Linear Classifiers	4096	5.19
	8192	6.90
k MLP Classifiers	4096	5.59
	8192	8.88
Factorized AR Head	4096	4.37
	8192	3.74

🔼 This table presents an ablation study on different design choices for the autoregressive (AR) head in the context of the proposed Factorized Quantization Generative Adversarial Network (FQGAN). Specifically, it investigates the impact of various AR head architectures on the quality of image generation. The study uses the FAR-Large model with a classifier-free guidance (CFG) scale of 1.75, comparing different head designs: multiple linear classifiers, multiple multi-layer perceptrons (MLPs), and the proposed factorized AR head. The results are evaluated using the Fréchet Inception Distance (FID) metric, a lower FID indicating better image generation quality. The experiment helps determine the optimal design for the AR head that effectively handles multiple sub-tokens produced by the factorized tokenizer in FQGAN.
read the caption
Table 4: Ablation study on the generation model head design with the proposed FQGAN tokenizer. We use FAR-Large model with cfg-scale=1.75 in this study.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Factorized Quantization#

Disentanglement Reg.#

Rep. Learning#

Autoregressive Gen.#

VQGAN Enhancements#

More visual insights#

Full paper#