Logo VGBench

Evaluating Large Language Models on Vector Graphics Understanding and Generation

EMNLP 2024
*Equal Contribution
University of Wisconsin-Madison

🔥[NEW!] We propose VGBench, the first dataset to comprehensively evaluate LLMs' vector graphics processing capabilities.

🔥[NEW!] We quantatively show that LLMs demonstrate decent vector graphics understanding and generation capabilities in TikZ, Graphviz, and SVGs, with a particular strength in understanding vector graphics code with higher-level semantics. Advanced prompting techniques such as Chain-of-Thought also improves performance.

🔥[NEW!] Open-source models such as Llama-3-70b shows competitive performance in understanding as compared to GPT-4, while GPT-4 outperforms others in generation.

Abstract

In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons, sketches and scientific figures. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs and (f) comparison with VLMs on rasterized representations. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced.

Introduction: Vector Graphics

Vector graphics are an alternative way to depict the visual world, widely used in content like cartoons, sketches and scientific figures.

Raster GraphicsRendered Vector GraphicsVector Graphics File <?xmlversion="1.0"encoding= "iso-8859-1" ?><svg version="1.1"id="Layer_1“… >… </svg> V.S.

There are three major types of vector graphics, such as SVG, TikZ and Graphviz.

SVGTikZGraphviz

In our benchmark, we divide our tasks into two categories: understanding and generation.

Task Overview: Understanding

For understanding, we designed three different question types for each of the vector graphics based on their individual strength.

(i) Concept(ii) Counting(iii) Relation(b) TikZ (i) Layout(ii) Domain(iii) Relations(c) GraphViz Q: What does this SVG image depict? A: AirplaneB: RocketC: BulletD: Tank Q: In the SVG image, what is the color of the zigzag shape? A: YellowB: RedC: BlueD: Green Q: Which function might this icon represent in a software application? A: Email sendingB: Document printingC: Music playingD: Picture editing (i) Color(ii) Category(iii) Usage (a) SVG Q: What is the general theme of this image? A: Musical scoreB: Organizational chartC: Math conceptD: Architectural blueprint Q: How many circles are there in this image? A: 1B: 2C: 3D: 4 Q: What is the position of B relative to A? A: B is to the right of AB: B is to the left of AC: B is above AD: B is below A Q: What is the main concept? A: Organizational ChartB: Software ArchitectureC: FlowchartD: Family Tree Q: What is the orientation of the text inside the box? A: horizontalB: verticalC: diagonalD: circular Q: What is connected to both 'Client' blocks? A: MySQL ServerB: MySQL ProxyC: appD: network core

Task Overview: Generation

First, we obtain captions for each vector graphics image by leveraging GPT-4V over its rasterized image. Then we prompt the LLM to generate the vector graphics code corresponding to the caption. Finally, we map the generated vector graphics into rasterzied images, then use CLIP Score and Fréchet Inception Distance (FID) Score to evaluate the quality of the generated vector graphics. The scores of the ground truth image is used as the upper bound to objectively evaluate the quality of the generated images.

Data Curation Pipeline

Vector graphics are converted into PNG format, then GPT-4V is utilized to generate the questions and answers (QA) candidates. Finally, human annotators filter the QA pairs to obtain the high-quality QA dataset.

Performance

Evaluation: Understanding

GPT-4 shows stronger performance in high-level vector graphics language (e.g., TikZ, Graphviz) compared to low-level vector graphics language SVG.

Different vector-graphic formats show diverse behaviors upon question types, where the results demonstrates that GPT-4 shows inferior performance in low-level vector graphics tasks, especially on tasks related to reasoning.

It can also be seen in our ablation study that GPT-4's performance varies significantly along each vector graphics code range.

SVG TikZ Graphviz
Model Category Color Usage Avg Concept Counting Relation Avg Domain Layout Relation Avg
GPT-4o 52.5 80.4 60.3 64.4 87.0 75.0 77.3 79.8 83.6 75.0 83.7 80.8
GPT-4 41.2 72.8 50.6 54.9 89.4 77.5 76.0 81.0 84.6 82.3 86.6 84.5
GPT-3.5-Turbo 33.4 50.5 47.1 43.7 76.7 56.8 54.4 62.6 83.6 62.5 63.5 69.9
Gemini-1.5-Pro 39.2 73.2 47.9 53.4 86.7 74.9 71.8 77.8 79.5 66.8 86.0 77.4
Llama-3-8B 32.3 39.8 48.0 40.0 64.6 53.0 45.9 54.5 68.0 52.5 55.8 58.8
Llama-3-70B 46.3 58.7 55.3 53.4 78.5 68.2 66.7 71.1 72.8 61.4 74.4 69.5
Qwen2-7B 33.3 48.7 46.3 42.8 79.4 64.7 58.3 67.5 81.8 57.3 68.6 69.2
Qwen2-72B 43.4 62.4 55.9 53.9 88.6 74.6 72.5 78.6 86.5 71.5 80.8 79.6
Phi-3-Mini-128K 34.1 29.8 49.7 37.9 70.6 52.5 50.7 57.9 74.7 58.9 68.6 67.4
Phi-3-Medium-128k 43.6 44.7 60.6 49.6 80.4 59.7 62.8 67.6 81.4 66.5 72.7 73.5
LLaVA-1.5-13b 83.0 85.2 84.0 84.1 64.3 34.3 44.8 47.8 46.7 53.9 49.7 50.1
Proprietary LLMs
Open-sourced LLMs
Large Multimodal Models

Evaluation: Generation

Both GPT-3.5 and GPT-4 show strong vector graphics generation capability. GPT-4 shows better performance than GPT-3.5 on CLIP score. Qualitative examples including the heart shape and flowchart generation also demonstrate the promising capability of VG generation using LLMs.

BibTeX

        
@inproceedings{zou-etal-2024-vgbench,
  title = "{VGB}ench: Evaluating Large Language Models on Vector Graphics Understanding and Generation",
  author = "Zou, Bocheng  and
    Cai, Mu  and
    Zhang, Jianrui  and
    Lee, Yong Jae",
  editor = "Al-Onaizan, Yaser  and
    Bansal, Mohit  and
    Chen, Yun-Nung",
  booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
  month = nov,
  year = "2024",
  address = "Miami, Florida, USA",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2024.emnlp-main.213",
  pages = "3647--3659",
  abstract = "In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons, sketches and scientific figures. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs and (f) comparison with VLMs on rasterized representations. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced.",
}
      

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Related Links: [Instruction Tuning with GPT-4]