VGBench

Evaluating Large Language Models on Vector Graphics Understanding and Generation

arXiv 2024
*Equal Contribution
University of Wisconsin-Madison

🔥[NEW!] We propose VGBench, the first dataset to comprehensively evaluate LLMs' vector graphics processing capabilities.

🔥[NEW!] We quantatively show that LLMs demonstrate decent vector graphics understanding and generation capabilities in TikZ, Graphviz, and SVGs, with a particular strength in understanding vector graphics code with higher-level semantics. Advanced prompting techniques such as Chain-of-Thought also improves performance.

🔥[NEW!] Open-source models such as Llama-3-70b shows competitive performance in understanding as compared to GPT-4, while GPT-4 outperforms others in generation.

Abstract

In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons or sketches. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG).

Task Overview: Understanding

In our benchmark, we focus on three different types of vector graphics: SVG, TikZ, and Graphviz. We divide our tasks into two categories: understanding and generation. For understanding, we designed three different question types for each of the vector graphics based on their individual strength.

Task Overview: Generation

First, we obtain captions for each vector graphics image by leveraging GPT-4V over its rasterized image. Then we prompt the LLM to generate the vector graphics code corresponding to the caption. Finally, we map the generated vector graphics into rasterzied images, then use CLIP Score and Fréchet Inception Distance (FID) Score to evaluate the quality of the generated vector graphics. The scores of the ground truth image is used as the upper bound to objectively evaluate the quality of the generated images.

Data Curation Pipeline

Vector graphics are converted into PNG format, then GPT-4V is utilized to generate the questions and answers (QA) candidates. Finally, human annotators filter the QA pairs to obtain the high-quality QA dataset.

QA examples

We systematically design a range of tasks based on the nature of each vector graphics category, aiming at a comprehensive evaluation across different semantic levels. For SVG, we design three types of questions: color, category, and usage; for TikZ, we use concept, counting, and relations as types of questions; while for Graphviz, we design layout, domain, and relations.

Performance

Evaluation: Understanding

GPT-4 shows stronger performance in high-level vector graphics language (e.g., TikZ, Graphviz) compared to low-level vector graphics language SVG.

Different vector-graphic formats show diverse behaviors upon question types, where the results demonstrates that GPT-4 shows inferior performance in low-level vector graphics tasks, especially on tasks related to reasoning.

It can also be seen in our ablation study that GPT-4's performance varies significantly along each vector graphics code range.

Evaluation: Generation

Both GPT-3.5 and GPT-4 show strong vector graphics generation capability. GPT-4 shows better performance than GPT-3.5 on CLIP score. Qualitative examples including the heart shape and flowchart generation also demonstrate the promising capability of VG generation using LLMs.

BibTeX


        @article{zou2024vgbench,
          title={VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation},
          author={Zou, Bocheng and Cai, Mu and Zhang, Jianrui and Lee, Yong Jae},
          journal={arXiv},
          year={2024},
          eprint={2407.10972},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2407.10972}, 
        }
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Related Links: [Instruction Tuning with GPT-4]