Skip to content

clean & unify tags for packages #226

@iamYannC

Description

@iamYannC

ggplot2 gallery: tags summary

2026-04-30

Following a misuse of tags myself, as a new ggplot2 extender, I decided
to take a deeper dive into tags in the ggplot2 extension gallery.
This full Quarto doc can be found here.

_config.yml lists every extension with a free-text tags field. Because the field is unconstrained, the field nearly useless for meaningful discovery.

TL;DR

  • 158 packages in the extension gallery.
  • 193 unique tags after lower-casing (one case-only collision was
    Visualization vs visualization).
  • On average, each package has 3.15 tags (md = 3, sd = 1.58).
    ggDNAvis leads with 12 unique tags.
  • visualization is the number one tag with 145 appearances;
    general follows with 62. Together they account for 42% of all
    tag uses but carry essentially no information.
  • The tag distribution is extremely long-tailed: 148 of 193 tags (77%)
    are used by exactly one package
    , and only 20 tags reach n ≥ 3.

What is wrong with the current tags

The raw .qmd file contains an object tags_in_pkgs with the summary
information. download & explore it for better understanding.

  1. Case / spelling / number variants that should clearly merge:
    Visualization vs visualization vs visualisation vs
    visualizations; geom vs geoms; time series vs time-series;
    theme vs themes; facet vs facets; outlier vs outliers;
    customisable vs customizable; algorithm vs algorithms.
  2. Filler tags that carry no signal. general is applied by 62
    packages and tells a reader nothing. visualization is applied by
    145 packages — i.e., almost everyone — so it is not a discriminator
    either. I mean, this IS a ggplot2 gallery after all.
  3. Specialized niche vocabulary. Several packages invent their own
    private taxonomy. Examples:
  • gganatogram: anatograms, tissue, anatomy, expression, pharmacology
  • ggDNAvis: DNA, RNA, customisable, customizable, medicine, methylation, sequence, FASTQ, ...
    (12 tags, all unique to this package).
  • ggblend: blending, affine transformation, layer algebra, compositing` (4 tags, all unique).

These are accurate descriptions but they cannot help anyone find the package, because no other entry uses the same words.

  1. Redundant within a single entry. ggoutlierscatterplot lists
    both outlier and outliers, both algorithm and algorithms.
    ggDNAvis lists both customisable and customizable.

Proposed unification

The core strategy is to reduce and unify tags to clusters that provide
actual information:

  • Normalise all tags to lower-case and trim whitespace.
  • Collapse spelling / case / number variants to a single canonical form.
  • Group small related tags (n < 3) under a topical umbrella when a
    clear cluster exists.

Better control in the future

  1. Drop or rename general and visualization. Both are applied
    broadly enough that they fail to discriminate between packages and
    add little to no signal.
  2. Encourage informative, topic-based tagging. Publish a suggested
    vocabulary in the contributing guide, organized by topic
    (life-sciences, spatial, time-series, distributions, etc..). Help us
    help the potential future user
  3. Tidymodels has a reactive table. Consider using one for the gallery too?

Reproducing the analysis

This analysis relies on _config.yml
from commit ed45cf8 and is fully parameterized in the script.
Download the full script here: ggplot2-gallery-tags-summary.qmd

Yann

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions