SEVA: Leveraging Sketches to Evaluate Alignment between Human and Machine Visual Abstraction

1University of Wisconsin, Madison 2University of California, San Diego 3Tel-Aviv University 4Reichman University 5Stanford University
*Equal Contribution

We introduce SEVA, a new benchmark dataset containing 90K human-generated sketches that systematically vary in their level of detail, and evaluate a suite of state-of-the-art vision models in their ability to understand them.


Sketching is a powerful tool for creating abstract images that are sparse but meaningful. Sketch understanding poses fundamental challenges for general-purpose vision algorithms because it requires robustness to the sparsity of sketches relative to natural visual inputs and because it demands tolerance for semantic ambiguity, as sketches can reliably evoke multiple meanings. While current vision algorithms have achieved high performance on a variety of visual tasks, it remains unclear to what extent they understand sketches in a human-like way.

Here we introduce SEVA, a new benchmark dataset containing 90K human-generated sketches of 128 object concepts produced under different time constraints, and thus systematically varying in sparsity. We evaluated a suite of state-of-the-art vision algorithms on their ability to correctly identify the target concept depicted in these sketches and to generate responses that are strongly aligned with human response patterns on the same sketch recognition task. We found that vision algorithms that better predicted human sketch recognition performance also better approximated human uncertainty about sketch meaning, but there remains a sizable gap between model and human response patterns.

To explore the potential of models that emulate human visual abstraction in generative tasks, we conducted further evaluations of a recently developed sketch generation algorithm (Vinker et al., 2022) capable of generating sketches that vary in sparsity. We hope that public release of this dataset and evaluation protocol will catalyze progress towards algorithms with enhanced capacities for human-like visual abstraction.


Visual abstraction as a key target for AI systems

Sketching is a powerful tool for understanding human visual abstraction—how we distill semantically relevant information from our experiences of a visually complex world. In fact, it is one of the most prolific and enduring visualization techniques in human history for conveying our ideas in visual form. (As old as 40,000-60,000 years and found in most cultures!) Perhaps this is because, even without specialized training, most people including kids can make and interpret sketches.

Take, for example, Picasso’s famous sketch series “The Bull” (1945). Despite some sketches being extremely sparse in details (right side) , each one is unmistakably a bull. What makes them so easily recognizable to us? Despite the fact that this ability to recognize visual concepts across levels of abstraction comes to us with such ease, it remains unclear to what extent current state-of-the-art vision algorithms are sensitive to visual abstraction in human-like ways.

With a goal towards developing robust and generalized AI systems that can recognize sketches, diagrams, and glyphs in human-like ways, here we tackle two critical challenges:

SEVA: A novel Sketch benchmark for evaluating visual abstraction in humans and machines

First, we needed to benchmark how humans generate sketches at varying levels of abstraction and across a diverse set of visual object concepts. To do this, we created the SEVA dataset! SEVA is a large-scale dataset of over 90,000 sketches that span 2,048 object instances belonging to 128 object categories. Because we wanted to test machine recognition of human visual abstraction, ~85,000 of these sketches were generated by people. But we also were also curious about machine recognition of machine visual abstraction and so ~5,000 sketches were generated by CLIPasso (Vinker et al., 2022).

To vary the abstraction level of the sketches, we asked people (N=5,563 participants) to produce their drawings under different timing constraints: 4 seconds, 8 seconds, 16 seconds, and 32 seconds. We asked CLIPasso to generate its sketches under different stroke constraints: 4 strokes, 8 strokes, 16 strokes, and 32 strokes.

Our work leverages a comprehensive set of object categories from the THINGS global initiative, an international research effort by multiple labs to develop large-scale datasets about the same but diverse range of object categories.

Beyond accuracy: Evaluating alignment between human and machine visual abstraction

Second, we wanted to evaluate how well state-of-the-art vision algorithms could recognize (i.e., correctly classify the target object category) the sketches, relative to human recognition performance and also how well the patterns of responses from AI models aligned with human patterns. To do this, we evaluated a diverse suite of 17 state-of-the-art vision models to evaluate human-model alignment. Building on prior research, we evaluated the models based on:

(1) Top-1 classification accuracy, reflecting raw sketch recognition performance.

However, since our goal was to go beyond simple accuracy-based metrics, we also benchmarked models on two additional metrics:

(2) Shannon entropy of the response distribution, reflecting the degree of uncertainty about the target object category; and

(3) Semantic neighbor preference, reflecting the degree to which models and humans generated off-target responses that were semantically related to the target object category.

By comparing the patterns of these metrics between models and humans, we empirically evaluated to what extent the machines' comprehension of visual abstraction differs from that of humans.

Measuring human and machine sketch understanding at multiple levels of abstraction

We observe certain similarities in the response patterns in human and model evaluations, but a large gap remains between their sketch understandings. Our findings reveal that sparser sketches exhibit higher levels of semantic ambiguity for both models and humans. In contrast, more detailed sketches are associated with higher top-1 classification performance, a tighter distribution of responses (lower entropy), and greater semantic neighbor preference.

Among all high-performing models, there are reliable differences between their response patterns under the three metrics, displaying systematic disparity in how they extract semantic information from sketches. For the time being, though, the state-of-the-art computer vision models exhibit a limited degree of alignment with humans at sketch understanding. A significant gap persists between the most closely aligned models and a baseline representing human-human consistency across all metrics.

Do machine-generated sketches elicit similar responses as human sketches?

While sketch understanding is a critical aspect of visual abstraction, the ability to produce sketches spanning different levels of abstraction is no less important. In our last experiment, we evaluated Clipasso, a CLIP-based vision model that was among the most performant and best aligned to human sketch understanding. Clipasso represents each sketch as a set of Bézier curves, and optimizes their parameters with respect to a CLIP-based perceptual loss. The model modulates the degree of abstraction by varying the number of strokes used. Our analysis reveals that human and Clipasso sketches are least divergent in their perceived meaning when they feature greater detail and are less abstract. In contrast, they exhibit more significant disparities when they have a larger degree of abstraction. This observation underscores the significant contrast between how CLIPasso and human participants attempted to preserve sketch meaning under more severe production constraints.

Sketch Gallery


  author    = {Mukherjee, Kushin and Huey, Holly and Lu, Xuanchen and Vinker, Yael and Aguina-Kang, Rio and Shamir, Ariel and Fan, Judith},
  title     = {SEVA: Leveraging sketches to evaluate alignment between human and machine visual abstraction},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2023}