VGBench

Evaluating Large Language Models on Vector Graphics Understanding and Generation

EMNLP 2024

Bocheng Zou^*, Mu Cai^*, Jianrui Zhang, Yong Jae Lee

^*Equal Contribution

University of Wisconsin-Madison

🔥[NEW!] We propose VGBench, the first dataset to comprehensively evaluate LLMs' vector graphics processing capabilities.

🔥[NEW!] We quantatively show that LLMs demonstrate decent vector graphics understanding and generation capabilities in TikZ, Graphviz, and SVGs, with a particular strength in understanding vector graphics code with higher-level semantics. Advanced prompting techniques such as Chain-of-Thought also improves performance.

🔥[NEW!] Open-source models such as Llama-3-70b shows competitive performance in understanding as compared to GPT-4, while GPT-4 outperforms others in generation.

Abstract

In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons, sketches and scientific figures. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs and (f) comparison with VLMs on rasterized representations. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced.

Introduction: Vector Graphics

Vector graphics are an alternative way to depict the visual world, widely used in content like cartoons, sketches and scientific figures.

There are three major types of vector graphics, such as SVG, TikZ and Graphviz.

In our benchmark, we divide our tasks into two categories: understanding and generation.

Task Overview: Understanding

For understanding, we designed three different question types for each of the vector graphics based on their individual strength.

Task Overview: Generation

First, we obtain captions for each vector graphics image by leveraging GPT-4V over its rasterized image. Then we prompt the LLM to generate the vector graphics code corresponding to the caption. Finally, we map the generated vector graphics into rasterzied images, then use CLIP Score and Fréchet Inception Distance (FID) Score to evaluate the quality of the generated vector graphics. The scores of the ground truth image is used as the upper bound to objectively evaluate the quality of the generated images.

Data Curation Pipeline

Vector graphics are converted into PNG format, then GPT-4V is utilized to generate the questions and answers (QA) candidates. Finally, human annotators filter the QA pairs to obtain the high-quality QA dataset.

Performance

Evaluation: Understanding

GPT-4 shows stronger performance in high-level vector graphics language (e.g., TikZ, Graphviz) compared to low-level vector graphics language SVG.

Different vector-graphic formats show diverse behaviors upon question types, where the results demonstrates that GPT-4 shows inferior performance in low-level vector graphics tasks, especially on tasks related to reasoning.

It can also be seen in our ablation study that GPT-4's performance varies significantly along each vector graphics code range.

	SVG				TikZ				Graphviz
Model	Category	Color	Usage	Avg	Concept	Counting	Relation	Avg	Domain	Layout	Relation	Avg
GPT-4o	52.5	80.4	60.3	64.4	87.0	75.0	77.3	79.8	83.6	75.0	83.7	80.8
GPT-4	41.2	72.8	50.6	54.9	89.4	77.5	76.0	81.0	84.6	82.3	86.6	84.5
GPT-3.5-Turbo	33.4	50.5	47.1	43.7	76.7	56.8	54.4	62.6	83.6	62.5	63.5	69.9
Gemini-1.5-Pro	39.2	73.2	47.9	53.4	86.7	74.9	71.8	77.8	79.5	66.8	86.0	77.4
Llama-3-8B	32.3	39.8	48.0	40.0	64.6	53.0	45.9	54.5	68.0	52.5	55.8	58.8
Llama-3-70B	46.3	58.7	55.3	53.4	78.5	68.2	66.7	71.1	72.8	61.4	74.4	69.5
Qwen2-7B	33.3	48.7	46.3	42.8	79.4	64.7	58.3	67.5	81.8	57.3	68.6	69.2
Qwen2-72B	43.4	62.4	55.9	53.9	88.6	74.6	72.5	78.6	86.5	71.5	80.8	79.6
Phi-3-Mini-128K	34.1	29.8	49.7	37.9	70.6	52.5	50.7	57.9	74.7	58.9	68.6	67.4
Phi-3-Medium-128k	43.6	44.7	60.6	49.6	80.4	59.7	62.8	67.6	81.4	66.5	72.7	73.5
LLaVA-1.5-13b	83.0	85.2	84.0	84.1	64.3	34.3	44.8	47.8	46.7	53.9	49.7	50.1

Proprietary LLMs

Open-sourced LLMs

Large Multimodal Models

Evaluation: Generation

Both GPT-3.5 and GPT-4 show strong vector graphics generation capability. GPT-4 shows better performance than GPT-3.5 on CLIP score. Qualitative examples including the heart shape and flowchart generation also demonstrate the promising capability of VG generation using LLMs.

BibTeX

        
@inproceedings{zou-etal-2024-vgbench,
  title = "{VGB}ench: Evaluating Large Language Models on Vector Graphics Understanding and Generation",
  author = "Zou, Bocheng  and
    Cai, Mu  and
    Zhang, Jianrui  and
    Lee, Yong Jae",
  editor = "Al-Onaizan, Yaser  and
    Bansal, Mohit  and
    Chen, Yun-Nung",
  booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
  month = nov,
  year = "2024",
  address = "Miami, Florida, USA",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2024.emnlp-main.213",
  pages = "3647--3659",
  abstract = "In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons, sketches and scientific figures. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs and (f) comparison with VLMs on rasterized representations. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced.",
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Related Links: [Instruction Tuning with GPT-4]