Data Imaginist

A bunch of giraffes, all bundled up

Wed, 28 Feb 2024 00:00:00 GMT

My return to ggraph development was not supposed to end out with something deserving of a blog post. It was supposed to be a quick triage of bugs to quench my bad conscience for not having looked at the package for quite some time (well, the instigator was the new ggplot2 release which required some changes in ggraph). Yet, one thing lead to another and now I’m sitting here, writing a release post indicating that it turned out to be more than just a series of bug fixes.

So, while this is certainly not a monumental release, let’s celebrate the fact that some very welcome additions managed to lure me into a proper update of the package.

What is ggraph? If you came to this blog post not knowing what this is all about (and made it past the two rambling top paragraphs) you have shown an impressive tenacity towards my R package work. ggraph is a ggplot2 extension for visualising relational data (networks, graph, hierarchies, etc.). It is one of the most versatile frameworks for creating network visualisation and all around a great package. You can learn more about it on it’s webpage, which also includes extensive documentation of it’s features.

If the above is old news to you you are probably sitting patiently waiting for me to tell you what is inside this new release. Wait no more…

Spatial layouts

Some time ago sfnetworks was developed on top of tidygraph to handle spatial network with the tidygraph API. Thanks to a PR from Lorena Crespo ggraph now works natively with this class. The layout itself is pretty simple as it takes the node location already stored in the object and uses them as is. But the layout is also accompanied by a new node and a new edge geom that ensures that the correct CRS is used during plotting etc. The basic use goes something like this:

library(ggraph)
library(tidygraph)

gr <- sfnetworks::as_sfnetwork(sfnetworks::roxel)

ggraph(gr, 'sf') + 
  geom_edge_sf(aes(color = type)) + 
  geom_node_sf(size = 0.3)

This can of course be used together with other sf layers for decorations and such (e.g. city boundaries) using geom_sf() from ggplot2.

While you are often going for as correct a representation of the location data as possible when working with spatial data, there are situations where a more stylized look is wanted. One such situation is for railroad and metro maps where the standard has long been to prefer legibility over correctness. ggraph now has a layout that places nodes in a manner akin to what we expect for these types of maps. It is, as many of the layouts in ggraph, provided through the graphlayouts package by David Schoch and, while it is a bit finicky, it can provide a great starting point for a grid-like graph layout.

gr <- as_tbl_graph(graphlayouts::metro_berlin) |> 
  convert(to_simple)

ggraph(gr, 'metro', y = lat, x = lon, grid_space = 0.005) + 
  geom_edge_link(width = 1) + 
  geom_node_point(size = 2) + 
  geom_node_point(size = 0.5, color = 'white') + 
  coord_fixed()

Hierarchical layouts

ggraph already has ample of layout choices if your data is hierarchical, and now you are spoiled for even more.

Cactustree is a layout that, if you squint your eyes and are a bit imaginative, resembles a cactus. While that sounds a bit odd at first, it makes pretty good sense once you see it. The layout was developed with hierarchical edge bundling in mind, so while it can certainly be used to show hierarchical relations there are probably better layouts for that if that is your only concern.

gr <- tbl_graph(flare$vertices, flare$edges) |> 
  mutate(class = stringr::str_match(name, "flare\\.(\\w+)")[,2])
from <- match(flare$imports$from, flare$vertices$name)
to <- match(flare$imports$to, flare$vertices$name)

ggraph(gr, 'cactustree', scale_factor = 0.5) + 
  geom_node_circle(aes(fill = class), colour = NA, alpha = 0.3, show.legend = FALSE) + 
  geom_conn_bundle(aes(alpha = after_stat(index)), data = get_con(from, to)) + 
  scale_edge_alpha(range = c(0.1, 0.5), guide = 'none') + 
  coord_fixed()

While the above layout is certainly flashy, the next one is not. The H tree layout is a space filling layout that can only be used for binary trees, so it’s application is quite limited. But, if you have a binary tree you need to show, this is your friend:

gr <- create_tree(1023, 2)

ggraph(gr, "htree") + 
  geom_edge_link() + 
  geom_node_point(aes(filter = leaf))

Other layout goodies

Some of the existing layouts have been updated with new features, worthy of a mention.

The linear layout now has a weight argument that can control the spacing between points. In conjunction with now outputting enough information for use with rect and arc nodes this opens up for some new possibilities

gr <- create_notable('Meredith') |> 
  convert(to_directed) |> 
  mutate(class = sample(letters[1:6], n(), replace = TRUE),
         size = pmax(0.1, 2 + rnorm(n())),
         amount = runif(n()))

ggraph(gr, "linear", circular = TRUE, weight = size) + 
  geom_edge_arc() + 
  geom_node_arc_bar(aes(r = 1 + amount/10, fill = class)) + 
  coord_fixed()

The other updates comes courtesy of new functionality in the graphlayouts package and brings the layouts provided by ggraph up to speed with the implementations in graphlayouts. This means that the focus and centrality layout gets a group argument that allows grouping of kindled nodes in these two layouts. Further, the stress layout (the default layout in ggraph) gains an x and y argument which can be used to fix some (or all) nodes in one or two dimensions. If either is given then NA values indicates that a node should be placed by the layout algorithm, given the constraints of the fixed nodes.

All them bundles

We talked about hierarchical edge bundling back when I showed the cactustree layout. While that was the first (I believe) type of edge bundling it did suffer from the fact that it needed an underlying hierarchical structure for the bundles to work. This created a disconnect between the graph the layout was created on and the edges that was shown (which is why they are drawn with geom_conn_*() not geom_edge_*() functions) that has later been sought to remove. This has created a bunch of different generalised edge bundling techniques and ggraph now supports a few thanks mainly to David Schoch (again).

The force bundling techniques treats edges as springs that attract each other if they run in parallel (it’s a bit more involved but that is the main gist). It was one of the first techniques to be developed and suffers from two main points. First, it is computationally expensive. In ggraph it is implemented with memoisation so that you don’t recalculate it again and again, but the first pass can be taxing for larger networks. Second, the bundling doesn’t really use any topological information when performing the bundling, and unrelated edges can thus end up in bundles together indicating interaction where none exist.

gr <- as_tbl_graph(edgebundle::us_flights)
states <- map_data("state")

ggraph(gr, x = longitude, y = latitude) + 
  geom_polygon(aes(long, lat, group = group), states, color = 'white', linewidth = 0.2) + 
  coord_sf(crs = 'NAD83', default_crs = sf::st_crs(4326)) + 
  geom_edge_bundle_force(color = 'white', width = 0.05)

If the above stated caveats have made you skeptic, ggraph also provides an alternative bundling technique that tackles both of them. The edge path bundling algorithm doesn’t use any attracting forces when bundling. Instead it directs edges through their shortest path on an increasingly sparse version of the input graph. This, again, results in bundling, but this time the topology of the graph is being used so the bundles should to a larger degree make sense. It is also much faster to compute.

ggraph(gr, x = longitude, y = latitude) + 
  geom_polygon(aes(long, lat, group = group), states, color = 'white', linewidth = 0.2) + 
  coord_sf(crs = 'NAD83', default_crs = sf::st_crs(4326)) + 
  geom_edge_bundle_path(color = 'white', width = 0.05)

In every way an improvement. However, remember that, just like with layouts, there is no single right answer when it comes to edge bundling. You are introducing a bias to the representation and trying out different approaches is always a good idea.

The last bundling technique is very quick and dirty and a home invention of mine. It works much like the edge path bundling but instead of gradually removing edges from the graph where the shortest path is searched for, they are all found in the minimal spanning tree so it can be done in one go. This makes it the most performant of the three but suffers from forcing a tree-like structure onto the topology that the edges follows. It usually also requires a higher max_distortion setting since the minimal spanning tree forces edges on a larger detour.

ggraph(gr, x = longitude, y = latitude) + 
  geom_polygon(aes(long, lat, group = group), states, color = 'white', linewidth = 0.2) + 
  coord_sf(crs = 'NAD83', default_crs = sf::st_crs(4326)) + 
  geom_edge_bundle_minimal(color = 'white', width = 0.05, max_distortion = 10)

All in all, the edge bundling support has been greatly enhanced. I’d still like to add a technique that better splits out edges going in opposite direction but that will be for another release. Edge path bundling does treat directed graphs differently since the shortest path is direction dependent but there are also other techniques that are worth exploring

# This network doesn't really make sense to view as directed but we do it anyway
# to show the difference in output
gr <- gr |> convert(to_directed)
ggraph(gr, x = longitude, y = latitude) + 
  geom_polygon(aes(long, lat, group = group), states, color = 'white', linewidth = 0.2) + 
  coord_sf(crs = 'NAD83', default_crs = sf::st_crs(4326)) + 
  geom_edge_bundle_path(color = 'white', width = 0.05)

Wrapping up

That’s about it. The release of course also includes numerous bug fixes, which was the whole reason why I started working on it in the first place. A lot of the new features presented couldn’t have happened without the work of David Schoch who has made great contributions to the network support in R and in tidygraph and ggraph in particular. Also a big thanks to the people working on sfnetworks and Lorena Crespo in particular for adding support in ggraph.

A small patch of free features

Mon, 08 Jan 2024 00:00:00 GMT

What is that? Another blog post not even a month after the last? This feels like 2017. Maybe I’m a bit extra attentive because I’ve had fun porting over my blog to quarto and also finally building a proper site for my generative art rather than lumping it into my R/OSS blog. Or maybe I just finally have interesting to share for the first time in a while…

That interesting thing today is a new release of patchwork — my package for easily combining multiple plots into complex and well-aligned compositions. It is not the grandest of releases — after all the package does what it does well — but it does provide two new features that I’ve been looking forward to:

There can be only one (axis)

One of the features in patchwork I’m particularly fond of is it’s ability to collect and de-duplicate legends. It is one of those touches that makes the final composition feel like a whole. Missing from this has been a similar function for axes. This has been even more glaring because we are used to de-duplicated axes from faceted plots and not having that in patchwork felt wrong. I always intended on adding this but never got around to it but thankfully Teun van den Brand took a stab at it and filled the gap.

This new functionality is two-fold as it is split up in axes and axis titles (though the setting for axis titles defaults to that for axes so you can usually get by only setting it for axes).

Consider these two plots:

library(patchwork)
library(ggplot2)

library(ggplot2)
p1 <- ggplot(mtcars) + 
  geom_point(aes(mpg, disp)) + 
  ggtitle('Plot 1')

p2 <- ggplot(mtcars) + 
  geom_boxplot(aes(gear, disp, group = gear)) + 
  ggtitle('Plot 2')

p1 + p2

As we can see they share the exact same y-axis and you might want to avoid the visual clutter of keeping the axis of the rightmost plot. Of course you could remove it through theming, setting the relevant theme elements to element_blank(). But that is such a hassle! Using the axis collecting is much easier:

p1 + p2 + plot_layout(axes = "collect")

If you like the clarity of the axis but prefer to not keep the title, you use the axis_titles argument instead

p1 + p2 + plot_layout(axis_titles = "collect")

Titles are collected if they are identical and the same is true for axes. This means that if you have two plots showing the same on the y-axis but with different ranges you can collect the titles but not the axis

p1 + p2 + coord_cartesian(ylim = c(100, 300)) + plot_layout(axes = "collect")

There is no facility to align the range of axes across plots so you’d still need to keep an eye on that. Still, you can always use & to apply the same coordinate system or scale to all plots in a patchwork so it should be relatively easy to line up plots.

One difference from the legend collection is that collecting axes only works for plots in the same nesting level. There are reasons for this, mainly my sanity level and capacity to sleep at night. Still, it means that one should be aware of the “hidden” nesting that can occur when using / and | for composition:

p1 + (p1 | p2) + plot_layout(axes = "collect")

A better approach for this would be to keep the same nesting level but use the widths argument to get the same look

p1 + p1 + p2 + plot_layout(widths = c(2, 1, 1), axes = "collect")

The attentive reader will observe that apart from “fixing” the problem at hand, something else happened to the plot. The middle plot suddenly lost it’s x-axis title and the x-axis title of the left plot got moved somewhat to the right. This is because axis title collecting works in both directions, i.e. if adjacent axis titles are identical they will get merged and the final title will occupy the full area of the merged ones. The effect may be more clear in a simpler layout:

p1 / p2 + plot_layout(axis_titles = "collect")

For the prior plot, if we would like to avoid this behavior because it is not obvious which x-axis title the middle plot relates to, we can set the collecting to only happen in one direction

p1 + p1 + p2 + plot_layout(widths = c(2, 1, 1), axes = "collect_y")

Being free from constraint

The other feature I’ll discuss will probably make a lot of people happy. The number of questions about how to not align plots are numerous and usually comes down to plots with excessively long y-axis labels (sorry for keeping with the mtcars dataset — I know we got it figured out quite well at this point):

p3 <- ggplot(mtcars) +
  geom_bar(aes(y = factor(gear), fill = factor(gear))) +
  scale_y_discrete(
    "",
    labels = c("3 gears are often enough",
               "But, you know, 4 is a nice number",
               "I would def go with 5 gears in a modern car")
  )
p3

We can see how such a plot could mess up a composition

p1 / p3

My answer to these questions/issues has always been to use wrap_elements() which, to be fair, gets the job done OK’ish

p1 / wrap_elements(plot = p3)

However, there are some shortcomings to this approach. First, it is pretty verbose and not very descriptive of what it does/what your intent is. This is not the end of the world, but the API of patchwork is pretty great (IMHO) so it feels like a bad concession to give all that up here. Second, using wrap_elements() “freezes” the plot inside it, so you can no longer modify it, e.g. with & or through guide collecting:

p1 / wrap_elements(plot = p3) + plot_layout(guides = "collect") & theme_dark()

Another thing is that the plot margin is part of the plot that gets inserted into the plot region. If we remove the legend and increase the margin we can see an annoying misalignment between the right edges of the plots:

p1 / wrap_elements(plot = p3 + theme(plot.margin = margin(20, 20, 20, 20), legend.position = "none"))

That was a lot of dunking on wrap_elements(). This is mainly because it was the wrong tool for the job, not because there is anything particularly wrong with it as is. No matter, we now have the right tool:

p1 / free(p3) + plot_layout(guides = "collect") & theme_dark()

There is not much more to it. Wrap a plot in free() if you want to forego the alignment that patchwork performs and it will do exactly that without getting in the way of the other functionality in the patchwork.

And now it is time to leave mtcars alone. Happy plotting!

A new focus on tidygraph

Mon, 18 Dec 2023 00:00:00 GMT

I’m pleased to announce a new release of tidygraph. It has been a while since something major has happened to the package, reflecting the stable nature of it, but this time I felt like doing a bit more than just brush it of for the occasional upstream dependency change. So, while it is in no way a grandiose release, it does contain enough new stuff to warrant a small blog post. If you are a tidygraph user you should definitely read on, otherwise perhaps explore the project website first and become a user.

Let us focus on the news

One new feature I’m particularly exited about is the inclusion of a new focus()/unfocus() pair of verbs. Part of my excitement is that this was one of my original ideas for the package but was scraped prior to release and then left to linger. The other reason is of course that it is super useful. So what does it do?

Let’s start with the why. For classic tabular data you generally expect all data to be equally important during computations. Each row is an observation that needs to be treated with the same care. You perhaps do some filtering but for the resulting filter, it again holds that each data is equally important. For such data the vectorised approach of R (and thus dplyr) makes perfect sense. We tend to want to calculate stuff for each row. The same is not always true for graph data. We might have nodes that are the main focus of our attention and nodes that are simply auxillary. But performing a filter will alter our graph, and that might change our calculations due to the connectedness of our data. For many calculations this is of little concern as the algorithms are so performant, meaning the vectorised paradigm of tidygraph is fine - we simply ignore it. But, what if we have a huge graph and an algorithm that scales exponentially with the number of edges and we really are only interested in the result of a few nodes or edges?

Enter the focus() verb. It allows you to perform a temporary filtering of the nodes or edges you are working on without removing the underlying graph structure. In practise it means that any tidygraph algorithms will only be called on the nodes or edges that are in focus but the algorithms will have access to the full graph and will thus return the same result for the focused nodes/edges irrespective of whether the focus was applied or not.

library(tidygraph)
graph <- play_forestfire(1e5, 0.1) |> 
  mutate(important = dplyr::row_number() <= 5) |> 
  focus(important) |> 
  mutate(efficiency = node_efficiency()) |> 
  unfocus()

graph |> 
  as_tibble() |> 
  slice(1:10)

# A tibble: 10 × 2
   important efficiency
             
 1 TRUE          0.0253
 2 TRUE          0.0274
 3 TRUE          0.0417
 4 TRUE          0.0306
 5 TRUE          0.0368
 6 FALSE        NA     
 7 FALSE        NA     
 8 FALSE        NA     
 9 FALSE        NA     
10 FALSE        NA

In the above code we calculate the local efficiency around each node, but since we are only interested in this measure for the first 5 nodes we focus on these and avoid computing it for the remaining 99995 nodes, gaining quite a speed boost. One (huge) caveat is that it is algorithm-dependent whether focusing on a subset provides a performance gain. Some algorithms work in a way were everything is calculated together, e.g. those that rely on convolutions of the distance matrix etc. In these cases no performance gain will be seen.

Focus can be applied both to nodes and edges depending on which one is activated. The focus is the weakest of all graph states and a graph will be unfocused if you either activate, group, or morph a graph so think of it as the most temporary state of them all.

Iterating on old ideas

Another old feature idea of mine that finally materialized is a set of iterate_*() verbs. Those are quite a bit simpler but useful nonetheless if you want to encode simple simulations on graphs using tidygraph syntax. You can think of these as functional equivalents of while () {} and for () {} so you can incorporate them into a pipe. As an example let’s consider a simulation that removes an edge unless it isolates one of its nodes:

unwire <- function(graph) {
  edge <- graph |> 
    activate(nodes) |> 
    mutate(well_connected = centrality_degree() > 1) |> 
    activate(edges) |> 
    mutate(can_remove = .N()$well_connected[from] & .N()$well_connected[to],
           will_remove = dplyr::row_number() == sample(dplyr::row_number(), 1L, prob = can_remove)) |> 
    pull(will_remove)
  graph |> 
    activate(edges) |> 
    filter(!edge)
}

We can use this function 20 times on our graph with the iterate_n() verbs like so:

create_notable('meredith') |> 
  iterate_n(20, unwire)

# A tbl_graph: 70 nodes and 120 edges
#
# An undirected simple graph with 1 component
#
# Node Data: 70 × 0 (active)
#
# Edge Data: 120 × 2
   from    to
   
1     1     5
2     1     6
3     1     7
# ℹ 117 more rows

Alternatively we can set up a condition to test for after each iteration that determines if iteration continues. Below we run the unwire() function until the graph has been split up into two components.

create_notable('meredith') |> 
  iterate_while(graph_component_count() == 1, unwire) |> 
  ggraph::autograph()

Catching up

It’s been a while since tidygraph has been updated with interfaces into new features from igraph. This release fixes that somewhat by providing the following new functions:

edge_is_bridge() will test for whether edges are bridges (their removal will result in splitting up a component into two
edge_is_feedback_arc() queries whether edges are part of the feedback arc set
graph_is_eulerian() and edge_rank_eulerian() provides access to eulerian path and cycle calculations
graph_efficiency() and node_efficiency() provides access to global and local efficiency calculations
group_leiden() and group_fluid() provides access to the new cluster_leiden() and cluster_fluid_communities() community detection algorithms
group_color() provides an interface to graph coloring. While not really a clustering algorithm the output matches closely with those as it provides a single id to each node
centrality_harmonic() supersedes centrality_closeness_harmonic() using an efficient C implementation over the flexible but slower implementation from the netrankr package
random_walk_rank() provides access to random walks on both edges and nodes
to_largest_component() and to_random_spanning_tree() are two new morphers
node_is_connected() tests whether nodes are connected to all or any of the nodes in a given set

Apart from changes in igraph, tidygraph also needs to stay somewhat current to another package, namely dplyr. In this release we have added support for the various slice_*() types so that you can now use e.g. slice_min() or slice_sample() on tbl_graph objects. And while not directly dplyr (but tidyr) you can now use replace_na() and drop_na() with tbl_graph objects as well.

Wrapping up

Mature packages are a weird thing as a developer. You seldom spend much time with them as they are working as intended, even if they are a cornerstone of some of your work. Tidygraph definitely falls into this spot. It was nice to get to relearn it a bit as I prepared this release and I hope the new additions will spark joy. Take care

Say Goodbye to “Good Taste”

Wed, 31 Mar 2021 00:00:00 GMT

I’m excited to announce the first release of the ggfx package, a package that brings R native filtering to grid and ggplot2 for the first time. You can install ggfx with:

install.packages('ggfx')

The purpose of ggfx is to give you access to effects that would otherwise require you to do some heavy post processing in programs such as Photoshop/Gimp or Illustrator/Inkscape, all from within R and as part of your reproducible workflow.

What is a filter?

A filter, in the context of image/photo editing is a function that takes in raster data (i.e. an image rasterised to pixel values) and modifies these pixels somehow, before returning a new image. As such, the idea has seen a lot of traction with apps such as Instagram which allows you to change the look of your photo by applying different filters to it.

So, a filter works with pixels. That provide some complications for vector based graphics such as the R graphics engine. Here you really don’t care about pixels, but simply instruct the engine to draw e.g. a circle at a specific position and with a certain radius and colour. The engine never comes in contact with the concept of pixels as it delegates the rendering to a graphics devices which may, or may not, render it as a raster. In many ways this is parallel to how SVG works. SVG also just records instructions which needs to be executed by a renderer (often a browser). Still, SVG have access to a limited amount of filters as part of it’s specification — how does that work? Usually when an SVG is rendered and it includes a filter, the filtered part will be rasterised off-screen, and the filter will be applied before it is all composed together.

This is a concept that can be transferred to R, and it is exactly what ggfx does!

Meet the filters!

ggfx contains quite a lot of filters - some are pure fun, others will shock you, a few will prove useful. All filters are prefixed as with_ to indicate that some graphic element should be rendered with the filter. To show this off, lot’s reach for one of the most easy to understand filters: blur!

library(ggplot2)
library(ggfx)

p <- ggplot(mpg) + 
  geom_point(aes(x = hwy, y = displ))

with_blur(p, sigma = 3)

We can see that the filter takes a graphic object, along with some filter specific settings, such as sigma which controls the amount of blur applied (specifically the size of the Gaussian kernel being used)

Now, it is not that common that you want to apply a filter to the full plot - thankfully, ggfx supports a range of different graphic objects and filters can thus equally be applied to layers:

ggplot(mpg) + 
  with_blur(
    geom_point(aes(x = hwy, y = displ)),
    sigma = 3
  )

Other graphic objects that can be filtered are theme elements and guides:

ggplot(mpg) + 
  geom_point(aes(x = hwy, y = displ)) + 
  guides(
    x = with_blur(
      guide_axis(),
      sigma = 2
    )
  ) + 
  theme(
    panel.grid.major = with_blur(
      element_line(),
      sigma = 2
    )
  )

With the basic API in mind we can take a look at the different filters:

Blur type filters

Blur is central to a lot of effect and thus part of many filters:

with_blur() as we have already seen, adds a constant blur to everything in it’s layer
with_variable_blur() allows you to control the amount and angle of blur at each location based on channel values in another layer
with_motion_blur() adds directional blur in a manner that simulates moving a camera/moving the subject
with_inner_glow() adds an inner glow effect to all objects in the layer (basically a coloured blur of the surroundings that is only visible on top of the objects
with_outer_glow() adds an outer glow effect (a coloured blur of the objects that is only visible in the surroundings)
with_drop_shadow() add a coloured blur underneath the layer with a specific offset
with_bloom() adds a specific blur effect to all light parts of the layer that simulates strong light spilling out into the surroundings

Blend type filters

Users of Photoshop and similar programs knows of the power of blending layers. Usually layers are just placed on top of each others, but that is just one possibility.

with_blend() allows you to blend two layers together based on both standard Duff-Porter alpha composition types, as well as others known from image editing programs such as Multiply, Overlay, and Linear Dodge
with_custom_blend() allows you to specify your own blend operation based on a standard formula coefficient setup
with_mask() allows you to set a mask on a layer, i.e. specify in which areas the layer is visible
with_interpolate() interpolates between two layers, fading them together

Dithering type filters

Dithering is the act of reducing the number of colours used in an image, while retaining the look of the original colour fidelity. This have had uses in both image size reduction and screen printing, but now is mostly used for the particular visual effect it provides.

with_dither() applies error correction dithering using the Floyd-Steinberg algorithm
with_ordered_dither() uses a threshold map of a certain size to create dithering (also called Bayer dithering)
with_halftone_dither() uses another type of threshold map that simulates halftone/offset printing
with_circle_dither() uses and alternative threshold map to the above to create more circular shapes
with_custom_dither() allows you to use a custom threshold map you’ve created for ImageMagick

Other filter types

There’s also a range of filters that defies grouping:

with_shade() allows you to shade a layer based on a given heightmap
with_kernel() allows you to apply a custom kernel convolution to the layer
with_displace() allows you to displace and distort your layer based an relative displacement values given in another layer
with_raster() simply rasterises your layer and displays that

Combining layers

As may be apparent from the descriptions above, filters sometimes work with multiple layers at the same time. To facilitate this ggfx can create layer references and layer group references which can then be used in another filter. We can showcase this with a blend filter. Below we create a reference to a text layer and blends it together with a polygon layer (through geom_circle() from ggforce) to achieve an effect that would be pretty difficult to have without using filters.

library(ggforce)

ggplot() + 
  as_reference(
    geom_text(aes(x = 0, y = 0, label = 'Blend Modes!'), size = 20, family = 'Fontania'),
    id = 'text_layer'
  ) + 
  with_blend(
    geom_circle(aes(x0 = 0, y0 = 0, r = seq_len(5)), fill = NA, size = 8),
    bg_layer = 'text_layer',
    blend_type = 'xor'
  ) + 
  coord_fixed()

Filters themselves can also be turned into references by assigning an id to them, which allows the result of a filter to be used in another filter:

ggplot() + 
  as_reference(
    geom_text(aes(x = 0, y = 0, label = 'Blend Modes!'), size = 20, family = 'Fontania'),
    id = 'text_layer'
  ) + 
  with_blend(
    geom_circle(aes(x0 = 0, y0 = 0, r = seq_len(5)), fill = NA, size = 8),
    bg_layer = 'text_layer',
    blend_type = 'xor',
    id = 'blended'
  ) + 
  with_inner_glow(
    'blended',
    colour = 'white',
    sigma = 5
  ) +
  coord_fixed()

Above we also see that filters can take references as their main graphic object instead of layers.

Some filters use other layers but only to extract variable parameters, e.g. seen in with_variable_blur() and with_displace(). Here we are only interested in the values in a single channel as it can be converted to a single integer value for each pixel. ggfx gives you plenty of choice as to which channel to use with the set of ch_ functions which can be applied to the reference. If none is given then the luminosity is used as default. To illustrate this we create a raster layer with the volcano data and applies a rainbow colour scale to it (😱) and then use the red and green channel to displace a circle:

volcano_long <- data.frame(
  x = as.vector(col(volcano)),
  y  = as.vector(row(volcano)),
  z = as.vector(volcano)
)
ggplot() + 
  as_reference(
    geom_raster(aes(x = y, y = x, fill = z), volcano_long, interpolate = TRUE, show.legend = FALSE),
    id = 'volcano'
  ) + 
  scale_fill_gradientn(colours = rainbow(15)) + 
  with_displacement(
    geom_circle(aes(x0 = 44, y0 = 31, r = 20), size = 10),
    x_map = ch_red('volcano'),
    y_map = ch_blue('volcano'), 
    x_scale = 5,
    y_scale = 5
  )

A last wrinkle to all this is that you don’t need to use other layers as references. You can use raster objects directly, or even a function that takes the width and height of the plot in pixels and generates a raster.

When you are using raster objects you can control how they are placed using an assortment of ras_ functions:

ggfx_logo <- as.raster(magick::image_read(
  system.file('help', 'figures', 'logo.png', package = 'ggfx')
))

ggplot(mpg) + 
  with_blend(
    geom_point(aes(x = hwy, y = displ), size = 5),
    bg_layer = ras_fit(ggfx_logo, 'viewport'),
    blend_type = 'xor'
  )

ggplot(mpg) + 
  with_blend(
    geom_point(aes(x = hwy, y = displ), size = 5),
    bg_layer = ras_tile(ggfx_logo, 'viewport', anchor = 'center', flip = TRUE),
    blend_type = 'xor'
  )

Why, oh why?

Having had a glimpse at what ggfx can do you might sit back, horror struck, asking yourself why I would launch such a full on attack on the purity and simplicity of data visualisation. Surely, this can only be used to impede understanding and, to use a popular term by Edward Tufte, create chart junk.

While there is some truth to the idea that data visualisations should communicate its content as clearly as possible, it is only one side of the coin and mainly applies to statistical charts. Data visualisation is also a device for story telling, and here the visual appearance of the chart can serve to underline the story and make the conclusions memorable. Having the artistic means to do that directly in R, in a reproducible manner, instead of being forced to manually edit your chart afterwards, is a huge boon for the graphic ecosystem in R and will set the creativity free in some data visualisation practitioners. If you doubt me, have a look at how ggfx has been used to great effect in the Tidy Tuesday project - even before it has been released proper.

Wrapping up

I’ve only shown a little glimpse at what ggfx can do — if I have piqued your interest I invite you to browse the package website. There you can see examples of all the different filters along with articles helping you to implement your own filters from scratch for the ultimate freedom.

Now, go out in to the world and make some memorable charts!

Insetting a new patchwork version

Mon, 09 Nov 2020 00:00:00 GMT

I’m delighted to announce that a new version of patchwork has been released on CRAN. This new version contains both a bunch of small bug fixes as well as some prominent features which will be showcased below.

If you are unaware of patchwork, it is a package that allows easy composition of graphics, primarily aimed at ggplot2, but with support for base graphics as well. You can read more about the package on its website.

For the remainder of this post we’ll use the following plots as examples:

library(ggplot2)
library(patchwork)
p1 <- ggplot(mtcars) + 
  geom_point(aes(mpg, disp)) + 
  ggtitle('Plot 1')

p2 <- ggplot(mtcars) + 
  geom_boxplot(aes(gear, disp, group = gear)) + 
  ggtitle('Plot 2')

p3 <- ggplot(mtcars) + 
  geom_point(aes(hp, wt, colour = mpg)) + 
  ggtitle('Plot 3')

Support for insets

At it’s inception patchwork was mainly designed to deal with alignment of plots displayed in a grid. This focus left out a small, but important for some, functionality for placing plots on top of each other. While it was possible to create a design with overlapping plots by combining different plotting areas:

design <- c(area(1, 1, 2, 2), area(2, 2, 3, 3), area(1, 3, 2, 4))
plot(design)

…this would still enforce an underlying grid, something that would come at odds with freely positioning insets. To make up for this patchwork has now gained an inset_element() function, which marks the given graphics as an inset to be added to the preceding plot. The function allows you to specify the exact location of the edges of the inset in any grid unit you want, thus giving you full freedom of the placement:

p1 + inset_element(p2, left = 0.5, bottom = 0.4, right = 0.9, top = 0.8)

By default the positions use npc units which goes from 0 to 1 in the chosen area, other units can be used as well, by giving them explicitly:

p1 + inset_element(p2, left = unit(1, 'cm'), bottom = unit(30, 'pt'), right = unit(3, 'in'),
                   top = 0.8)

The default is to position the inset relative to the panel, but this can be changed with the align_to argument:

p1 + inset_element(p2, left = 0.5, bottom = 0.4, right = 1, top = 1, align_to = 'full')

When it comes to all other functionality in patchwork, insets behaves as regular plots. This means that they are amenable to change after the composition:

p_all <- p1 + inset_element(p2, left = 0.5, bottom = 0.4, right = 1, top = 1) + p3
p_all[[2]] <- p_all[[2]] + theme_classic()
p_all

p_all & theme_dark()

It can also get tagged automatically:

p_all + plot_annotation(tag_levels = 'A')

which can be turned off in the same manner as for wrap_elements():

p_all <- p1 + 
  inset_element(p2, left = 0.5, bottom = 0.4, right = 1, top = 1, ignore_tag = TRUE) + 
  p3
p_all + plot_annotation(tag_levels = 'A')

Arbitrary tagging sequences

While we’re discussing tagging, patchwork now allows you to provide your own sequence to use, instead of relying on the Latin character, Roman, or Arabic numerals that patchwork understands. This can be used by supplying a list of character vectors to the tag_levels argument instead of a single vector:

p_all <- p1 | (p2 / p3)
p_all + plot_annotation(tag_levels = list(c('one', 'two', 'three')))

When working with multiple tagging levels, custom sequences can be mixed with the automatic ones:

p_all[[2]] <- p_all[[2]] + plot_layout(tag_level = 'new')
p_all + plot_annotation(tag_levels = list(c('one', 'two', 'three'), 'a'), tag_sep = '-')

Raster support

While patchwork was designed with ggplot2 in mind it has always supported additional graphic types such as grobs and base graphics (by using formula notation). This release adds support for an additional type: raster. The raster class (and nativeRaster class) are bitmap representation of images and they are now recognized directly and with the wrap_elements() function:

logo <- system.file('help', 'figures', 'logo.png', package = 'patchwork')
logo <- png::readPNG(logo, native = TRUE)

p1 + logo

Since they are implemented as wrapped elements they can still be titled etc:

p1 + logo + ggtitle('Made with this:') + theme(plot.background = element_rect('grey'))

They can of course also be used with the new inset feature to easily add watermarks etc.

p1 + inset_element(logo, 0.9, 0.8, 1, 1, align_to = 'full') + theme_void()

The future

That’s it for this release. There are no shortage of feature requests for patchwork and I’ll not make any promises, but I hope the next release will focus on adding support for gganimate as well as improvements to the annotation feature so that global axis labels can be added as well and annotations are kept in nested plots.

Stay safe!

A noisy start

Wed, 18 Mar 2020 00:00:00 GMT

I was sure I had released this… Honestly, I thought the new version of ambient had landed on CRAN a year ago. What does that say about me as a developer? Probably not something very positive. One reason is probably that ambient is one of my smaller packages mostly made for myself. It generates noise patterns which is something I use extensively in my generative art. And the version of ambient I’m now announcing has been available on my own computer for a long time, so I haven’t noticed the lack of a real CRAN release.

What is noise

Anyway, what is this package really about? It is a package that facilitates the generation of multidimensional noise of different kinds. Noise should not be equated with completely random values, R has extensive support for generating these through the different distribution sampling functions. The noise that ambient is capable of producing are random, but spatially correlated noise patterns… what on earth is that? Let’s have a look!

library(ambient)
library(dplyr)

image(noise_perlin(dim = c(300, 400)))

We see in the above example that the pattern is sort of random, but it remains structured so the value at each point is highly correlated to its neighbors. While we have looked at a 2D example, this principle can be expanded to 3 or even 4 dimensions.

The example above used the old interface which is already available on CRAN. That interface simply returns matrices or arrays with the x and y (and z and t) values corresponding to the indices of each cell. This is fast, but super limiting, and the new and promoted interface that you’ll see in a second adds much more control and power.

A new API

The limitation of the old API was mainly that you were bound to only retrieve values at integer coordinates. This in turn limited the amount of weird operations you might want to do to the coordinates before using them to calculate a noise value. Further, it simply felt clunky and didn’t fit in very well with any type of function composition.

The new API (the old still exists) is centered around a long-format grid representation that you create with long_grid(). It basically creates an adorned data frame with coordinates for each row, but provides additional functionality for converting back to matrix/arrays and raster object:

grid <- long_grid(x = seq(0, 1, length.out = 1000),
                  y = seq(0, 1, length.out = 1000))

grid

## # A tibble: 1,000,000 x 2
##        x       y
##       
##  1     0 0      
##  2     0 0.00100
##  3     0 0.00200
##  4     0 0.00300
##  5     0 0.00400
##  6     0 0.00501
##  7     0 0.00601
##  8     0 0.00701
##  9     0 0.00801
## 10     0 0.00901
## # … with 999,990 more rows

You can create higher dimensions by simply providing z and t arguments to long_grid() as well. This is all kind of boring of course since we haven’t added any noise yet (which is kinda the point of all this). Don’t worry - it will come.

The generators

There are many different types of noise that can be generated with ambient. Perlin noise is perhaps the most well-known (it did land the creator an Oscar after all), but many other exists with different characteristics. All of these can be sampled with the new family of gen_*() functions (generator functions). These all take coordinates along with different other arguments such as e.g. frequency and seed. As an example lets calculate some worley noise:

grid <- grid %>% 
  mutate(
    noise = gen_worley(x, y, frequency = 5, value = 'distance')
  )
grid

## # A tibble: 1,000,000 x 3
##        x       y noise
##        
##  1     0 0       0.203
##  2     0 0.00100 0.207
##  3     0 0.00200 0.211
##  4     0 0.00300 0.215
##  5     0 0.00400 0.219
##  6     0 0.00501 0.223
##  7     0 0.00601 0.228
##  8     0 0.00701 0.232
##  9     0 0.00801 0.236
## 10     0 0.00901 0.241
## # … with 999,990 more rows

We have now created a new column with the respective worley noise value for each cell. It is usually easier to understand by looking at it:

grid %>% 
  plot(noise)

We see that the as.raster() method takes an expression that defines what value should be used for the raster. We normalize it so that it lies between 0 and 1 (a requirement of the raster class) and then use the plot method provided for the raster class.

There are a bunch of these gen_() functions. Further, there are also a bunch of gen_() functions for creating non-noise patterns, e.g.

grid %>% 
  mutate(
    pattern = gen_waves(x, y, frequency = 5)
  ) %>%  
  plot(pattern)

You may feel at this point that the old interface was much nicer, but the great thing about the generators is that they don’t care about whether the coordinates you feed into it lie in a grid. This means that they can be used to directly look up noise values for particles in a simulation, or modify the grid coordinates before they are passed into the generator. The latter is what is known as noise perturbation and was only available in a very limited form in the old API.

grid %>% 
  mutate(
    pertube = gen_simplex(x, y, frequency = 5) / 10,
    noise = gen_worley(x + pertube, y + pertube, value = 'distance', frequency = 5)
  ) %>% 
  plot(noise)

Funky, right? Just to explain what is really going on, each cell in the grid gets a simplex based value, which it then uses to offset its own coordinates before looking up its worley noise value. As simplex noise has a smooth gradient we get these waves distortions of the worley noise.

Fractured noise

The output of e.g. gen_perlin() does not look like what you’d expect if you are used to working with perlin noise (I’d guess). This is because perlin noise is most often used in its fractal form. Fractal noise simply means calculating multiple values for each coordinates at different frequencies and somehow combining them. The most well known is fractal brownian motion (fbm) that simply adds each value together with decreasing intensity, but any combination scheme is possible and ambient comes with a few. To create fractal noise with the new interface we use the fracture() method and pass in a generator and a fractal function along with the different arguments to it:

# Classic perlin noise (combining 4 different frequencies)
grid %>% 
  mutate(
    noise = fracture(gen_perlin, fbm, octaves = 4, x = x, y = y, freq_init = 5)
  ) %>% 
  plot(noise)

ambient comes with a handful of different fractal function and you can create your own as well

# clamp noise before adding them together
grid %>% 
  mutate(
    noise = fracture(gen_perlin, clamped, octaves = 4, x = x, y = y, freq_init = 5)
  ) %>% 
  plot(noise)

There are a few other functions as part of this release for e.g. blending values together and calculating derived values from noise fields (e.g. curl and gradient). I will let it be up to you to explore these at your own accord.

Vectorising like a (semi)pro

Sun, 15 Mar 2020 00:00:00 GMT

This is a short practical post about programming with R. Take it for what it is and nothing more…

R is slow! That is what they keep telling us (they being someone who “knows” about “real” programming and has another language that they for some reason fail to be critical about).

R is a weird thing. Especially for people who has been trained in a classical programming language. One of the main reasons for this is its vectorised nature, which is not just about the fact that vectors are prevalent in the language, but is an underlying principle that should guide the design of efficient algorithms in the language. IF you write R like you write C (or Python), then sure it is slow, but really, you are just using it wrong.

This post will take you through the design of a vectorised function. The genesis of the function comes from my generative art, but I thought it was so nice and self-contained that it would make a good blog post. If that seems like something that could take your mind off the pandemic, then buckle up!

The problem

I have a height-map, that is, a matrix of numeric values. You know what? Let’s make this concrete and create one:

library(ambient)
library(dplyr)

z <- long_grid(1:100, 1:100) %>% 
  mutate(val = gen_simplex(x, y, frequency = 0.02)) %>% 
  as.matrix(val)

image(z, useRaster = TRUE)

This is just some simplex noise of course, but it fits our purpose…

Anyway, we have a height-map and we want to find the local extrema, that is, the local minimum and maximum. That’s it. Quite a simple and understandable challenge right.

Vectorised, smecktorised

Now, had you been a trained C-programmer you would probably have solved this with a loop. This is the way it should be done in C, but applying this to R will result in a very annoyed programmer who will tell anyone who cares to listen that R is slow.

We already knew this. We want something vectorised, right? But what is vectorised anyway? All over the internet the recommendation is to use the apply()-family of function to vectorise your code, but I have some bad news for you: This is the absolute wrong way to vectorise. There are a lot of good reasons to use the functional approach to looping instead of the for-loop, but when it comes to R, performance is not one of them.

Shit…

To figure this out, we need to be a bit more clear about what we mean with a vectorised function. There are some different ways to think about it

The broad and lazy definition is a function that operates on the elements of a vector. This is where apply() (and friends) based functions reside.
The narrow and performant definition is a function that operates on the elements of a vector in compiled code. This is where many of R’s base functions live along with properly designed functions implemented in C or C++
The middle ground is a function that is composed of calls to 2. to avoid explicit loops, thus deferring most element-wise operations to compiled code.

We want to talk about 3.. Simply implementing this in compiled code would be cheating, and we wouldn’t learn anything.

Thinking with vectors

R comes with a lot of batteries included. Some of the more high-level function are not implemented with performance in mind (sadly), but a lot of the basic stuff is, e.g. indexing, arithmetic, summations, etc. It turns out that these are often enough to implement pretty complex functions in an efficient vectorised manner.

Going back to our initial problem of finding extrema: What we effectively are asking for is a moving window function where each cell is evaluated on whether it is the largest or smallest value in its respective window. If you think a bit about this, this is mainly an issue of indexing. For each element in the matrix, we want the indices of all the cells within its window. Once we have that, it is pretty easy to extract all the relevant values and use the vectorised pmin() and pmax() function to figure out the maximum value in the window and use the (vectorised) == to see if the extrema is equivalent to the value of the cell.

That’s a lot of talk, here is the final function:

extrema <- function(z, neighbors = 2) {
  ind <- seq_along(z)
  rows <- row(z)
  cols <- col(z)
  n_rows <- nrow(z)
  n_cols <- ncol(z)
  window_offsets <- seq(-neighbors, neighbors)
  window <- outer(window_offsets, window_offsets * n_rows, `+`)
  window_row <- rep(window_offsets, length(window_offsets))
  window_col <- rep(window_offsets, each = length(window_offsets))
  windows <- mapply(function(i, row, col) {
    row <- rows + row
    col <- cols + col
    new_ind <- ind + i
    new_ind[row < 1 | row > n_rows | col < 1 | col > n_cols] <- NA
    z[new_ind]
  }, i = window, row = window_row, col = window_col, SIMPLIFY = FALSE)
  windows <- c(windows, list(na.rm = TRUE))
  minima <- do.call(pmin, windows) == z
  maxima <- do.call(pmax, windows) == z
  extremes <- matrix(0, ncol = n_cols, nrow = n_rows)
  extremes[minima] <- -1
  extremes[maxima] <- 1
  extremes
}

(don’t worry, we’ll go through it in a bit)

This function takes a matrix, and a neighborhood radius and returns a new matrix of the same dimensions as the input, with 1 in the local maxima, -1 in the local minima, and 0 everywhere else.

Let’s go through it:

# ...
  ind <- seq_along(z)
  rows <- row(z)
  cols <- col(z)
  n_rows <- nrow(z)
  n_cols <- ncol(z)
# ...

Here we are simply doing some quick calculations upfront for reuse later. The ind variable is simply the index for each cell in the matrix. Matrices are simply vectors underneath, so they can be indexed like that as well. rows and cols holds the row and column index of each cell, and n_rows and n_cols are pretty self-explanatory.

# ...
  window_offsets <- seq(-neighbors, neighbors)
  window <- outer(window_offsets, window_offsets * n_rows, `+`)
  window_row <- rep(window_offsets, length(window_offsets))
  window_col <- rep(window_offsets, each = length(window_offsets))
# ...

Most of the magic happens here, but it is not that apparent. What we do is that we use the outer() function to construct a matrix, the size of our window, holding the index offset from the center for each of the cells in the window. We also construct vectors holding the rows and column offset for each cell

# ...
  windows <- mapply(function(i, row, col) {
    row <- rows + row
    col <- cols + col
    new_ind <- ind + i
    new_ind[row < 1 | row > n_rows | col < 1 | col > n_cols] <- NA
    z[new_ind]
  }, i = window, row = window_row, col = window_col, SIMPLIFY = FALSE)
# ...

This is where all the magic appear to happen. For each cell in the window, we are calculating it’s respective value for each cell in the input matrix. I can already hear you scream about me using and apply()-like function, but the key thing is that I’m not using it to loop over the elements of the input vector (or matrix), but over a much smaller (and often fixed) number of elements.

If you want to leave now because I’m moving the goal-posts by my guest.

Anyway, what is happening inside the mapply() call? Inside the function we figure out which row and column the offsetted cell is part of. Then we calculate the index of the cells for the offset. In order to guard against out-of-bounds errors we set all the indices that are out of bound to NA, and then we simply index into our matrix. The crucial part is that all of the operations here are vectorised (indexing, arithmetic, and comparisons). In the end we get a list holding vectors of values for each cell in the window.

# ..
  windows <- c(windows, list(na.rm = TRUE))
  minima <- do.call(pmin, windows) == z
  maxima <- do.call(pmax, windows) == z
  extremes <- matrix(0, ncol = n_cols, nrow = n_rows)
  extremes[minima] <- -1
  extremes[maxima] <- 1
  extremes
# ..

This is really just wrapping up, even though the actual computations are happening here. We use pmin() and pmax() to find the maximum and minimum across each window, and compare it to the value in our input matrix (again, all proper vectorised function). In the end we construct a matrix holding 0s and use the calculated positions to set 1 or -1 at the location of local extremes.

Does it work?

I guess that is the million dollar question, closely followed by “is it faster?”. I don’t really care enough to implement a “dumb” vectorisation, so I’ll just put my head on the block with the last question and insist that, yes, it is much faster. You can try to beat me with an apply() based solution and I’ll eat a sticker if you succeed (unless you cheat).

As for the first question, let’s have a look

extremes <- extrema(z)
extremes[extremes == 0] <- NA

image(z, useRaster = TRUE)
image(extremes, col = c('black', 'white'), add = TRUE)

Lo and behold, it appears as if we succeeded.

Can vectorisation save the world?

No…

More to the point, not every problem has a nice vectorised solution. Further, the big downside with proper vectorisation is that it often requires expanding a lot of variables to the size of the input vector. In our case we needed to hold all windows in memory simultaneously, and it does not take too much imagination to think up scenarios where that may make our computer explode. Still, more often than not it is possible to write super performant R code, and usually the crucial part is to figure out how to do some intelligent indexing.

If you are still not convinced then read through Brodie Gaslam’s blog. He has a penchant for implementing ridiculously complicated stuff in highly efficient R code. It goes without saying that his posts are often more involved than this, but if you have kept reading until this point, I think you are ready…

Don’t be a Dick

Fri, 13 Dec 2019 00:00:00 GMT

As the year reaches its final conclusion I use to take a look back at the year that passed and do some naval gazing… I released this and that, I did a talk or two, etc.

But honestly… I’ve made new releases, I have a great job that allows me to work on open source software and get paid for doing my hobby. Life is good!

The world is shit though…

A Plea

I’ve spend most of my adult life developing software and giving it away for free, no strings attached. I try to be a welcoming and helpful part of the community. Lines has to be drawn though…

If you support fascism, racism, misogyny, or any of the other ugly heads that bigotry has, either openly or indirectly by voting for the likes of Trump or Johnson, I have a plea for you:

don’t use my code.

don’t open issues.

don’t ask for help.

This is not an addendum to any license I provide, nor is it in any way legally binding (I wouldn’t know how to achieve that). This is simply a plea from one person to another. You choose to support movements that runs counter to everything open source software stands for, and the least you could do is to not stand on our shoulders as you fight us.

It goes without saying that this plea can only extend to the code I create in my spare time and release as a private person.

It also goes without saying that, should this plea apply to you, you’ll probably ignore it because you have already cast aside decency. If you do ignore it just know: I actively despise you as a user…

To Everyone Else

As the world goes to shit I’d like to up my commitment to support those hurt the most by it. Do you need feedback, help, or otherwise, with anything I might be able to chip in on (mostly R and generative art), and are you a minority in any way, I invite you to reach out, and I’ll do what I can.

Merry Christmas

Addendum 16/12/19

Thankfully people have mostly reacted in positive to this post. This was kind of expected as I think the R community is by and large on the right side of history. A few people have taken issue with me naming political leaders and their supporters directly, thinking this is about political disagreement. This is not the case.

Bigotry is not politics!

You can be a republican and not support Trump. You can be a tory and not support Johnson and if you do I both applaud you for your conviction and welcome you to my small sphere of R packages. If you are a republican/tory and feel uneasy about the current leadership, but still choose to vote for them, you have put your morale values up for sale. This is entirely your choice. I will call you out on it.

Others have indicated that any type of disagreement should simply not have a bearing in the open source world. First, OS is activistic in its very nature, and second, good job on living a priviliged life if you think this is the first time people take a stand in the R world…

Patch it up and send it out

Sun, 01 Dec 2019 00:00:00 GMT

I am super, super thrilled to finally be able to announce that patchwork has been released on CRAN. Patchwork has, without a doubt, been my most popular unreleased package and it is great to finally make it available to everyone.

Patchwork is a package for composing plots, i.e. placing multiple plots together in the same figure. It is not the only package that tries to solve this. grid.arrange() from gridExtra, and plot_grid() from cowplot are two popular choices while some will claim that all you need is base graphics and layout() (they would be wrong, though). Do we really need another package for this? I personally feel that patchwork brings enough innovation to the table to justify its existence, but if you are a happy user of cowplot::plot_grid() I’m not here to force you away from that joy.

The claim to fame of patchwork is mainly two things: A very intuitive API, and a layout engine that promises to keep your plots aligned no matter how complex a layout you concoct.

library(ggplot2)
library(patchwork)

p1 <- ggplot(mpg) + 
  geom_point(aes(hwy, displ))
p2 <- ggplot(mpg) + 
  geom_bar(aes(manufacturer, fill = stat(count))) + 
  coord_flip()

# patchwork allows you to add plots together
p1 + p2

If you find this intriguing, you should at least give patchwork a passing glance. I’ve already written at length about all of its features at its webpage, so if you don’t want to entertain my ramblings more than necessary, make haste to the Getting Started guide, or one of the in-depth guides covering:

The Patch that Worked

If you are still here, I’ll tell you a bit more about the package, and round up with some examples of my favorite features in patchwork. As I described in my look back at 2017 patchwork helped me out of burn-out fueled by increasing maintenance burdens of old packages. At that time I don’t think I expected two years to pass before it got its proper release, but here we are… What I don’t really go into is why I started on the package. The truth is that I was beginning to think about the new gganimate API, but was unsure whether it was possible to add completely foreign objects to ggplots, alter how it behaves, while still allowing normal ggplot2 objects to be added afterwards. I was not prepared to create a POC of gganimate to test it out at this point, so I came up with the idea of trying to allow plots to be added together. The new behavior was that the two plots would be placed beside each other, and the last plot would still be able to receive new ggplot objects. It worked, obviously, and I began to explore this idea a bit more, adding more capabilities. I consciously didn’t advertise this package at all. I was still burned out and didn’t want to do anything for anyone but myself, but someone picked it up from my github and made a moderately viral tweet about it, so it quickly became popular despite my intentions. I often joke that patchwork is my most elaborate tech-demo to date.

All that being said, I was in search for a better way to compose plots (I think most R users have cursed about misaligned axes and butchered facet_wrap() into a layout engine) and I now had a blurry vision of a solution, so I had to take it out of tech-demo land, and begin to treat it as a real package. But, along came gganimate and swallowed up all my development time. Further, I had hit a snag in how nested layouts worked that meant backgrounds and other elements were lost. This snag was due to a fundamental part of why patchwork otherwise worked so well, so I was honestly in no rush to get back to fixing it.

So patchwork lingered, unreleased…

At the start of 2019 I decided that the year should be dedicated to finishing of updates and unreleased packages, and by November only patchwork remained. I was still not feeling super exited about getting back to the aforementioned snag, but I saw no way out so I dived in. After having explored uncharted areas of grid in search of something that could align the layout engine implementation with not removing background etc. I was ready to throw it all out, but I decided to see how hard it would be to simply rewrite a subset of the layout engine. 1 day later I had a solution… There is a morale in there somewhere, I’m sure — feel free to use it.

The Golden Patches

I don’t want to repeat what I’ve written about at length in the guides I linked to in the beginning of the post, so instead I’ll end with simply a few of my favorite parts of patchwork. There will be little explanation about the code (again, check out the guides), so consider this a blindfolded tasting menu.

# A few more plots to play with
p3 <- ggplot(mpg) + 
  geom_smooth(aes(hwy, cty)) + 
  facet_wrap(~year)
p4 <- ggplot(mpg) + 
  geom_tile(aes(factor(cyl), drv, fill = stat(count)), stat = 'bin2d')

Human-Centered API

Patchwork implements a few API innovations to make plot composition both quick, but also readable: Consider this code

(p1 | p2) /
   p3

It is not too difficult to envision what kind of composition comes out of this and, lo and behold, it does exactly what is expected:

As layout complexity increases, the use of operators get less and less readable. Patchwork allows you to provide a textual representation of the layout instead, which scales much better:

layout <- '
ABB
CCD
'
p1 + p2 + p3 + p4 + plot_layout(design = layout)

Capable auto-tagging

When plot compositions are used in scientific literature, the subplots are often enumerated so they can be referred to in the figure caption and text. While you could do that manually, it is much easier to let patchwork do it for you.

patchwork <- (p4 | p2) /
                p1
patchwork + plot_annotation(tag_levels = 'A')

If you have a nested layout, as in the above, you can even tell patchwork to create a new tagging level for it:

patchwork <- ((p4 | p2) + plot_layout(tag_level = 'new')) /
                 p1
patchwork + plot_annotation(tag_levels = c('A', '1'))

It allows you to modify subplots all at once

What if want to play around with the theme? Do you begin to change the theme of all of your subplots? No, you use the & operator that allows you to add ggplot elements to all your subplots:

patchwork & theme_minimal()

It shepherds the guides

Look at the plot above. The guides are annoying, right. Let’s put them together:

patchwork + plot_layout(guides = 'collect')

That is, visually, better but really we only want a single guide for the fill. patchwork will remove duplicates, but only if they are alike. If we give them the same range, we get what we want:

patchwork <- patchwork & scale_fill_continuous(limits = c(0, 60))
patchwork + plot_layout(guides = 'collect')

Pretty nice, right?

This is not a grammar

I’ll finish this post off with something that has been rummaging inside my head for a while, and this is as good a place as any to put it. It seems obvious to call patchwork a grammar of plot composition, after all it expands on ggplot2 which has a grammar of graphics. I think that would be wrong. A grammar is not an API, but a theoretical construct that describes the structure of something in a consistent way. An API can be based on a grammar (as is the case for ggplot2 and dplyr) which will guide its design, or a grammar can be developed in close concert with an API as I tried to do with gganimate. Not everything lends itself well to being described by a grammar, and an API is not necessarily bad if it is not based on one (conversely, it may be bad even if it is). Using operators to combine plots is hardly a reflection of an underlying coherent theory of plot composition, much less a reflection of a grammar. It is still a nice API though.

Why do I need to say this? It seems like the programming world has been taken over by grammars and you may feel bad about just solving a problem with a nice API. Don’t feel bad — “grammar” has just been conflated with “cohesive API” lately.

Towards some new packages

As mentioned in the beginning, I set out to mainly finish off stuff in 2019. tidygraph, ggforce, and ggraph has seen some huge updates, and with patchwork finally released I’ve reached my year goal with time to spare. I’ll be looking forward to creating something new again, but hopefully find a good rhythm where I don’t need to take a year off to update forgotten projects.

The Colour of Everything

Wed, 13 Nov 2019 00:00:00 GMT

I’m happy to announce that farver 2.0 has landed on CRAN. This is a big release comprising of a rewrite of much of the internals along with a range of new functions and improvements. Read on to find out what this is all about.

The case for farver

The first version of farver really came out of necessity as I identified a major performance bottleneck in gganimate related to converting colours into Lab colour space and back when tweening them. This was a result of grDevices::convertColor() not being meant for use with millions of colour values. I build farver in order to address this very specific need, which in turn made Brodie Gaslam look into speeding up the grDevices function. The bottom line is that, while farver is still the fastest at converting between colour spaces, grDevices is now so fast that I probably wouldn’t have bothered to build farver in the first place had it been like this all along. I find this a prime example of fruitful open source competition and couldn’t be happier that Brodie took it upon him.

So why a new shiny version? As part of removing compiled code from scales, we decided to adopt farver for colour interpolation, and the code could use a brush-up. I’ve become much more trained in writing compiled code, and further there were some shortcomings in the original implementation that needed to be addressed if scales (and thus ggplot2) should depend on it. Further, I usually write on larger frameworks and there is a certain joy in picking a niche area that you care about and go ridiculously overboard in tooling without worrying about if it benefits any other than yourself (ambient is another example of such indulgence).

The new old

The former version of farver was quite limited in functionality. It had two functions: convert_colour() and compare_colour() that did colour space conversion and colour distance calculations respectively. No outward changes has been made to these functions, but internally a lot has happened. The old versions had no input validation, so passing in colours with NA, NaN, Inf, and -Inf would give you some truly weird results back. Further, the input and output was not capped to the range of the given colour space, so you could in theory end up with negative RGB values if you converted from a colour space with a larger gamut than sRGB. Both of these issues has been rectified in the new version. Any non-finite value in any channel will result in NA in all channels in the output (for conversion) or an NA distance (for comparison).

library(farver)
colours <- cbind(r = c(0, NA, 255), g = c(55, 165, 20), b = c(-Inf, 120, 200))
colours

##        r   g    b
## [1,]   0  55 -Inf
## [2,]  NA 165  120
## [3,] 255  20  200

convert_colour(colours, from = 'rgb', to = 'yxy')

##            y1        x        y2
## [1,]       NA       NA        NA
## [2,]       NA       NA        NA
## [3,] 25.93626 0.385264 0.1924651

Further, input is now capped to the channel range (if any) before conversion, and output is capped again before returning the result. The later means that convert_colour() is only symmetric (ignoring rounding errors) if the colours are within gamut in both colour spaces.

# Equal output because values are capped between 0 and 255
colours <- cbind(r = c(1000, 255), g = 55, b = 120)
convert_colour(colours, 'rgb', 'lab')

##             l        a        b
## [1,] 57.41976 76.10097 12.44826
## [2,] 57.41976 76.10097 12.44826

Lastly, a new colour space has been added: CIELch(uv) (in farver hcl) has been added as a cousin of CIELch(ab) (lch). Both are polar transformations, but the former is based on luv values and the latter on lab. Both colour spaces are used interchangeably (though not equivalent), and as the grDevices::hcl() function is based on the luv space it made sense to provide an equivalent in farver.

The new new

The new functionality mainly revolves around the encoding of colour in text strings. In many programming languages colour can be encoded into strings as #RRGGBB where each channel is given in hexadecimal digits. This is also how colours are passed around in R mostly (R also has a list of recognized colour names that can be given as aliases instead of the hex string - see grDevices::colour() for a list). The encoding is convenient as it allows colours to be encoded into vectors, and thus into data frame columns or arrays, but means that if you need to perform operations on it you’d have to first decode the string into channels, potentially convert it into the required colour space, do the manipulation, convert back to sRGB, and encode it into strings. Encoding and decoding has been supported in grDevices with rgb() and col2rgb() respectively, both of which are pretty fast. col2rgb() has a quirk in that the output has the channels in the rows instead of the columns, contrary to how decoded colours are presented everywhere else:

grDevices::col2rgb(c('#56fec2', 'red'))

##       [,1] [,2]
## red     86  255
## green  254    0
## blue   194    0

farver sports two new functions that, besides providing consistency in the output format also eliminates some steps in the workflow described above:

# Decode strings with decode_colour
colours <- decode_colour(c('#56fec2', 'red'))
colours

##        r   g   b
## [1,]  86 254 194
## [2,] 255   0   0

# Encode with encode_colour
encode_colour(colours)

## [1] "#56FEC2" "#FF0000"

Besides the basic use shown above, both function allows input/output from other colour spaces than sRGB. That means that if you need to manipulate some colour in Lab space, you can simply decode directly into that, do the manipulation and encode directly back. The functionality is baked into the compiled code, meaning that a lot of memory allocation is spared, making this substantially faster than a grDevices-based workflow:

library(ggplot2)

# Create some random colour strings
colour_strings <- sample(grDevices::colours(), 5000, replace = TRUE)

# Get Lab values from a string
timing <- bench::mark(
  farver = decode_colour(colour_strings, to = 'lab'),
  grDevices = convertColor(t(col2rgb(colour_strings)), 'sRGB', 'Lab', scale.in = 255), 
  check = FALSE,
  min_iterations = 100
)
plot(timing, type = 'ridge') +
  theme_minimal() + 
  labs(x = NULL, y = NULL)

Can we do better than this? If the purpose is simply to manipulate a single channel in a colour encoded as a string, we may forego the encoding and decoding completely and do it all in compiled code. farver provides a family of functions for doing channel manipulation in string encoded colours. The channels can be any channel in any colour space supported by farver, and the decoding, manipulation and encoding is done in one pass. If you have a lot of colours and need to increase e.g. darkness, this can save a lot of memory allocation:

# a lot of colours
colour_strings <- sample(grDevices::colours(), 500000, replace = TRUE)

darken <- function(colour, by) {
  colour <- t(col2rgb(colour))
  colour <- convertColor(colour, from = 'sRGB', 'Lab', scale.in = 255)
  colour[, 'L'] <- colour[, 'L'] * by
  colour <- convertColor(colour, from = 'Lab', to = 'sRGB')
  rgb(colour)
}
timing <- bench::mark(
  farver = multiply_channel(colour_strings, channel = 'l', value = 1.2, space = 'lab'),
  grDevices = darken(colour_strings, 1.2),
  check = FALSE,
  min_iterations = 100
)
plot(timing, type = 'ridge') + 
  theme_minimal() + 
  labs(x = NULL, y = NULL)

The bottom line

The new release of farver provides invisible improvements to the existing functions and a range of new functionality for working efficiently with string encoded colours. You will be using it indirectly following the next release of scales if you are plotting with ggplot2, but you shouldn’t be able to tell. If you somehow ends up having to manipulate millions of colours, then farver is still the king of the hill by a large margin when it comes to performance, but I personally believe that it also provides a much cleaner API than any of the alternatives.

1 giraffe, 2 giraffe, GO!

Mon, 02 Sep 2019 00:00:00 GMT

I am beyond excited to finally be able to announce a new version of ggraph. This release, like the ggforce 0.3.0 release, has been many years in the making, laying dormant for long periods first waiting for ggplot2 to get updated and then waiting for me to have time to finally finish it off. All that is in the past now as ggraph 2.0.0 has finally landed on CRAN, filled with numerous new features, a massive amount of bug fixes, and a slew of breaking changes.

If you are new to ggraph, a short description follows: It is an extension of ggplot2 that implement an extended grammar for relational data (e.g. trees and networks). It provides a huge variety of geoms for drawing nodes and edges, along with an assortment of layouts making it possible to produce a very wide range of network visualization types. It is to my knowledge the most feature packed network visualization framework available in R (and potentially in other languages as well), all building on top of the familiar ggplot2 API. If you want to learn more I invite you to browse the new pkgdown website that has been made available.

New looks

Before we begin with the exiting new stuff, there’s a small change that may or may not greet you as you make your first new plot with ggraph v2.0.0. The default look of a ggplot is often not a good fit for network visualisations as the positional scales are irrelevant. Because of this ggraph has since its release offered a theme_graph() that removed a lot of the useless clutter such as axes and grid lines. You had to use it deliberately though as I didn’t want to overwrite any defaults you may have had. In the new release I’ve relaxed on this a bit. When you construct a ggraph plot it will still use the default theme as a base, but it will remove axes and gridlines from it. This makes it easier to use it together with coorporate templates and the likes right out the box. You can still use theme_graph(), or potentially set it as a default using set_graph_style() if you so wish.

library(ggraph)

# THe new default look:
ggraph(highschool) + 
  geom_edge_link() + 
  geom_node_point()

# Using theme_graph for the remainder of this post
set_graph_style(size = 11, plot_margin = margin(0, 0, 0, 0))

The broken giraffe

Let us start proper with what this release breaks, because it does it for some very good reasons and you’ll all be happy about it shortly as you read on. The 1.x.x versions of ggraph worked with two different types of network representations: igraph objects and dendrogram object. Some further types such as hclust and network objects were supported by automatic conversion, but that was it. Further, the internal architecture meant that certain layouts and geoms could only be used with certain objects. This was obviously an imperfect situation and one that reflected that tidygraph was developed after ggraph. In ggraph 2.0.0 the internals have been rewritten to only be based on tidygraph. This means that all layouts and geoms will always be available (as long as the topology supports it). This doesn’t mean that igraph, dendrogram, network, and hclust objects are no longer supported, though. Every input will be attempted to be coerced to a tbl_graph object, and as tidygraph supports a wealth of network representations, ggraph can now be used with an even wider selection of objects, all completely without any need for change from the user.

While this change was completely internal and thus didn’t break anything, it did put in to question the API of the ggraph() function, which had been designed before tidy evaluation and tidygraph came into existence. Prior to 2.0.0 all layout arguments passed into ggraph() (and create_layout()) would be passed as strings if they referenced any node or edge property, e.g.

library(tidygraph)

graph <- as_tbl_graph(
  data.frame(
    from = sample(5, 20, TRUE),
    to = sample(5, 20, TRUE),
    weight = runif(20)
  )
)

ggraph(graph, layout = 'fr', weights = "weight") + 
  geom_edge_link() + 
  geom_node_point()

With the new API, edge and node parameters are passed along as unquoted expressions that will be evaluated in the context of the edge or node data respectively. The example above will this be:

ggraph(graph, layout = 'fr', weights = weight) + 
  geom_edge_link() + 
  geom_node_point()

This change might seem superficial and unnecessary until you realize that this means the network object doesn’t have to be updated every time you want to try new edge and node parameters for the layout:

ggraph(graph, layout = 'fr', weights = sqrt(weight)) + 
  geom_edge_link() + 
  geom_node_point()

So, that’s the extent of the breakage… Now what does this change allow..?

Tidygraph inside

The use of tidygraph runs much deeper than simply being used as the internal network representation. ggraph will also register the network object during creation and rendering of the plot, meaning that all tidygraph algorithms are available as input to layout specs and aesthetic mappings:

graph <- as_tbl_graph(highschool)

ggraph(graph, layout = 'fr', weights = centrality_edge_betweenness()) + 
  geom_edge_link() + 
  geom_node_point(aes(size = centrality_pagerank(), colour = node_is_center()))

It is obvious (at least to me) that this new-found capability will make it much easier to experiment and iterate on the visualization, hopefully inspiring users to try out different settings before settling on a plot.

As discussed above, the tidygraph integration also makes it easy to plot a wide variety of data types directly. Above we first create a tbl_graph from the highschool edge-list, but that is strictly not necessary:

head(highschool)

##   from to year
## 1    1 14 1957
## 2    1 15 1957
## 3    1 21 1957
## 4    1 54 1957
## 5    1 55 1957
## 6    2 21 1957

ggraph(highschool, layout = 'kk') + 
  geom_edge_link() + 
  geom_node_point()

Note that even though the input is not a tbl_graph it will be converted to one so all the tidygraph algorithms are still available during plotting.

To further make it easy to quickly gain an overview over your network data, ggraph gains a qgraph() function that inspects you input and automatically picks a layout and combination of edge and node geoms. While the return type is a standard ggraph/ggplot object it should not really be used as the basis for a more complicated plot as you have no influence over how the layout and first couple of layers are chosen.

iris_clust <- hclust(dist(iris[, 1:4]))

qgraph(iris_clust)

Layout galore

ggraph 2.0.0 comes with a huge selection of new layouts, from new algorithms for the classic node-edge diagram to completely new types such as matrix and (bio)fabric layouts. The biggest addition comes from the integration of the graphlayouts package by David Schoch who has done a tremendous job in bringing new, high quality, layout algorithms to R. The ‘stress’ layout is the new default as it does a much better job than fruchterman-reingold (‘fr’). It also includes a sparse version ‘sparse_stress’ for large graphs that are much faster than any of the ones provided by igraph.

# Defaults to stress, with a message
ggraph(graph) + 
  geom_edge_link() + 
  geom_node_point()

## Using `stress` as default layout

There are other layouts from graphlayouts of interest, e.g. the ‘backbone’ layout that emphasize community structure, the ‘focus’ layout that places all nodes in concentric circle based on their distance to a selected node etc. I wont show them all here but instead direct you to its github page that describes all its different layouts.

Another type of layout that has become available is the unrooted equal-angle and equal-daylight algorithms for drawing unrooted trees. This type of trees are different than those resulting from e.g. hierarchical clustering in that they do not contain direction or a specific root node. The tree structure is only given by the branch length. To support this the ‘dendrogram’ layout has gained a length argument that allows the layout to be calculated from branch length:

library(ape)
data(bird.families)
# Using the bird.orders dataset from ape
ggraph(bird.families, 'dendrogram', length = length) + 
  geom_edge_elbow()

Often the dendrogram layout is a bad choice for unrooted trees, as it implicitly shows a node as the root and draw everything else according to that. Instead one can choose the ‘unrooted’ layout where leafs are attempted evenly spread across the plane.

ggraph(bird.families, 'unrooted', length = length) + 
  geom_edge_link()

By default the equal-daylight algorithm is used but it is possible to also get the simpler, but less well-dispersed equal-angle version as well by setting daylight = FALSE.

The new version also brings two new special layouts (special meaning non-standard): ‘matrix’ and ‘fabric’, which, like the ‘hive’ layout, brings their own edge and node geoms. The matrix layout places nodes on a diagonal and shows edges by placing points at the horizontal and vertical intersection of the terminal nodes. The selling point of this layout is that it scales better as there is no possibility of edge crossings. On the other hand is matrix layouts very dependent on the order in which nodes are placed, and as the network growth so does the possible ordering of nodes. There exist however a large range of node ranking algorithm that can be used to provide an effective ordering and many of these are available in tidygraph. It can take some time getting used to matrix plots but once you begin to recognize patterns in the plot and how it links to certain topological features of the network, they can become quite effective tools:

# Create a graph where internal edges in communities are grouped
graph <- create_notable('zachary') %>%
  mutate(group = factor(group_infomap())) %>%
  morph(to_split, group) %>%
  activate(edges) %>%
  mutate(edge_group = as.character(.N()$group[1])) %>%
  unmorph()

## Warning: `as_quosure()` requires an explicit environment as of rlang 0.3.0.
## Please supply `env`.
## This warning is displayed once per session.

ggraph(graph, 'matrix', sort.by = node_rank_hclust()) + 
  geom_edge_point(aes(colour = edge_group), mirror = TRUE) + 
  coord_fixed()

As can be seen in the example above it is often useful to mirror edges to both sides of the diagonal to make the patterns stronger. Highly connected nodes are easily recognizable, without suffering from over-plotting, and by choosing an appropriate ranking algorithm communities are easily visible. In addition to gemo_edge_point() ggraph also provides geom_edge_tile() for a different look.

The fabric layout (originally called biofabric, but I have decided to drop the prefix to indicate it can be used generally), is another layout approach that tries to deal with the problems of over-plotting. It does so by drawing all edges as evenly spaced vertical lines, and all nodes as evenly spaced horizontal lines. As with the matrix layout it is highly dependent on the sorting of nodes, and requires some getting used to. I urge you to give it a chance though, potentially with some help from the website its inventor has set up:

ggraph(graph, 'fabric', sort.by = node_rank_fabric()) + 
  geom_node_range(aes(colour = group), alpha = 0.3) + 
  geom_edge_span(aes(colour = edge_group), end_shape = 'circle') + 
  coord_fixed() + 
  theme(legend.position = 'top')

The node_rank_fabric() is the ranking proposed in the original paper, but other ranking algorithms are of course also possible.

The last new feature in the layout department is that it is now easier to plug in new layouts. First, by providing a matrix or data.frame to the layout argument in ggraph() you can quickly provide a fixed position of the nodes. The same can be obtained by providing an x and y argument to the ‘auto’ layout. Second, you can provide a function directly to the layout argument. The function must take a tbl_graph as input and return a data.frame or an object coercible to one. This means that e.g. layouts defined as physics simulations with the particles package can be used directly:

library(particles)
# Set up simulation
sim <- . %>% simulate() %>% 
  wield(manybody_force) %>% 
  wield(link_force) %>% 
  evolve()

ggraph(graph, sim) + 
  geom_edge_link(colour = 'grey') + 
  geom_node_point(aes(colour = group), size = 3)

Geoms for the people

While ggraph has always included quite a large range of different geoms for showing nodes and edges, this release has managed to add some more. Most importantly, geom_edge_fan() has gained a brother in crime for showing multi-edges. geom_edge_parallel() will draw edges as straight lines but, in the case of multi-edges, will offset them slightly orthogonal to its direction so that there is no overlap. This is a geom best suited for smaller graphs (IMO), but here it can add a very classic look to the plot:

small_graph <- create_notable('bull') %>%
  convert(to_directed) %>%
  bind_edges(data.frame(from = c(1, 2, 5, 3), to = c(2, 1, 3, 2)))

ggraph(small_graph, 'stress') + 
  geom_edge_parallel(end_cap = circle(.5), start_cap = circle(.5),
                     arrow = arrow(length = unit(1, 'mm'), type = 'closed')) + 
  geom_node_point(size = 4)

For this edge geom in particular it is often a good idea to use capping to let them end before they reaches the terminal nodes.

Another edge geom that has become available is geom_edge_bend() which is sort of an organic elbow geom:

ggraph(iris_clust, 'dendrogram', height = height) + 
  geom_edge_bend()

Lastly, in addition to the node and edge geoms shown in the Layout section, geom_node_voronoi() has been added. It is a ggraph specific version of ggforce::geom_voronoi_tile() that allows you to create a Voronoi tessellation of the nodes and use the resulting tiles to show the nodes. As with the ggforce version it is possible to constrain the tiles to a specific radius around the edge making it a great way of showing which nodes dominates certain areas without any problems with over-plotting.

ggraph(graph, 'stress') + 
  geom_node_voronoi(aes(fill = group), max.radius = 0.5, colour = 'white') + 
  geom_edge_link() + 
  geom_node_point()

A last little thing pertaining to edge geoms is that many have gained a strength argument, which controls their level of non-linearity (this is obviously only available for non-linear edges). Setting strength = 0 will result in a linear edge, while setting strength = 1 will give the standard look. Everything in between is fair game, while everything outside that range will look exceptionally weird, probably.

ggraph(iris_clust, 'dendrogram', height = height) + 
  geom_edge_bend(alpha = 0.3) + 
  geom_edge_bend(strength = 0.5, alpha = 0.3) + 
  geom_edge_bend(strength = 0.2, alpha = 0.3)

ggraph(iris_clust, 'dendrogram', height = height) + 
  geom_edge_elbow(alpha = 0.3) + 
  geom_edge_elbow(strength = 0.5, alpha = 0.3) + 
  geom_edge_elbow(strength = 0.2, alpha = 0.3)

A few geoms have had arguments such as curvature or spread that have had similar purpose, but those arguments have been deprecated in favor of the same argument across all (applicable) geoms.

And then one more last thing, but it is really not something new in ggraph. As you can use standard geoms for drawing nodes some of the new features in ggforce is of particular interest to ggraph users. The geom_mark_*() family in particular is great for annotating single, or groups of nodes, and going forward it will be the advised approach:

library(ggforce)
ggraph(graph, 'stress') + 
  geom_edge_link() + 
  geom_node_point() + 
  geom_mark_ellipse(aes(x, y, label = 'Group 3', 
                        description = 'A very special collection of nodes',
                        filter = group == 3))

All the rest

These are the exiting new stuff, but the release also includes numerous bug fixes and small tweaks… Far to many to be interesting to list, so you must take my work for it 😄.

As with ggforce I hope that ggraph never goes this long without a release again. Feel free to flood me with feature request after you have played with the new version and I’ll do my best to take them on.

I’ll spend some time on ggplot2 and grid for now, but still plan on taking a development sprint with patchwork with the intend of getting it on CRAN before the end of this year.

A Flurry of Facets

Thu, 08 Aug 2019 00:00:00 GMT

When I announced the last release of ggforce I hinted that I would like to transition to a more piecemeal release habit and avoid those monster releases that the last one was. True to my word, I am now thrilled to announce that a new version of ggforce is available on CRAN for your general consumption. It goes without saying that this release contains fewer features and fixes than the last one, but those it packs are considerable so let’s get to it.

Build for gganimate

The gganimate package facilitates the creation of animations from ggplot2 plots. It is build to be as general purpose as possible, but it still makes a few assumptions about how the layers in the plot behaves. Some of these assumptions where not met in a few of the ggforce geoms (the technical explanation was that some stats and geoms stripped group information from the data which trips up gganimate). This has been rectified in the new version of ggforce and all geoms should now be ready for use with gganimate (please report back if you run into any problems).

Facets for the people

The remainder of the release centers around facets and a few geoms that has been made specifically for them.

Enter the matrix

The biggest news is undoubtedly the introduction of facet_matrix(), a facet that allows you to create a grid of panels with different data columns in the different rows and columns of the grid. Examples of such arrangements are known as scatterplot matrices and pairs plots, but these are just a subset of the general approach.

Before we go on I will, in the interest of full disclosure, mention that certain types of scatterplot matrices have been possible for a long time. Most powerful has perhaps been the ggpairs() function in GGally that provides an API for pairs plots build on top of ggplot2. More low-level and limited has been the possibility of converting the data to a long format by stacking the columns of interest and using facet_grid(). The latter approach requires that all columns of interest are of the same type and further moves a crucial operation of the visualization out of the visualization API. The former approach, while powerful, is a wrapper around ggplot2 rather than an extension of the API. This means that you are limited to what the wrapper function provides thus loosing the flexibility of the ggplot2 API. A plurality of choices is good though, and I’m certain that there are rooms for all approaches to thrive.

To show off facet_matrix() I’ll start with a standard use of scatterplot matrices, namely plotting multiple components from a PCA analysis against each other.

library(recipes)
# Data described here: https://bookdown.org/max/FES/chicago-intro.html 
load(url("https://github.com/topepo/FES/blob/master/Data_Sets/Chicago_trains/chicago.RData?raw=true"))

pca_on_stations <- 
  recipe(~ ., data = training %>% select(starts_with("l14_"))) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors()) %>%
  step_pca(all_predictors(), num_comp = 5) %>% 
  prep() %>% 
  juice()

pca_on_stations

## # A tibble: 5,698 x 5
##       PC1   PC2     PC3     PC4   PC5
##             
##  1   1.37 4.41   0.347   0.150  0.631
##  2   1.86 4.50   0.618   0.161  0.523
##  3   2.03 4.50   0.569   0.0468 0.543
##  4   2.37 4.43   0.498  -0.209  0.559
##  5   2.37 4.13   0.422  -0.745  0.482
##  6 -15.7  1.23   0.0164 -0.180  1.04 
##  7 -21.2  0.771 -0.653   1.35   1.23 
##  8  -8.45 2.36   1.07   -0.143  0.404
##  9   3.04 4.30   0.555  -0.0476 0.548
## 10   2.98 4.45   0.409  -0.125  0.677
## # … with 5,688 more rows

library(ggforce)

ggplot(pca_on_stations, aes(x = .panel_x, y = .panel_y)) + 
  geom_point(alpha = 0.2, shape = 16, size = 0.5) + 
  facet_matrix(vars(everything()))

Let’s walk through that last piece of code. We construct a standard ggplot using geom_point() but we map x and y to .panel_x and .panel_y. These are placeholders created by facet_matrix(). Lastly we add the facet_matrix() specification. At a minimum we’ll need to specify which columns to use. For that we can use standard tidyselect syntax as known from e.g. dplyr::select() (here we use everything() to select all columns).

Now, the above plot has some obvious shortcomings. The diagonal is pretty useless for starters, and it is often that these panels are used to plot the distributions of the individual variables. Using e.g. geom_density() won’t work as it always start at 0, thus messing with the y-scale of each row. ggforce provides two new geoms tailored for the diagonal: geom_autodensity() and geom_autohistogram() which automatically positions itself inside the panel without affecting the y-scale. We’d still need to have this geom only in the diagonal, but facet_matrix() provides exactly this sort of control:

ggplot(pca_on_stations, aes(x = .panel_x, y = .panel_y)) + 
  geom_point(alpha = 0.2, shape = 16, size = 0.5) + 
  geom_autodensity() +
  facet_matrix(vars(everything()), layer.diag = 2)

As the y-scale no longer affects the diagonal we’ll emphasize this by removing the horizontal grid lines there:

ggplot(pca_on_stations, aes(x = .panel_x, y = .panel_y)) + 
  geom_point(alpha = 0.2, shape = 16, size = 0.5) + 
  geom_autodensity() +
  facet_matrix(vars(everything()), layer.diag = 2, grid.y.diag = FALSE)

There is still some redundancy left. As the grid is symmetrical the upper and lower triangle shows basically the same (with flipped axes). We could add some insight by using another geom in one of the areas that showed some summary statistic instead:

ggplot(pca_on_stations, aes(x = .panel_x, y = .panel_y)) + 
  geom_point(alpha = 0.2, shape = 16, size = 0.5) + 
  geom_autodensity() +
  geom_density2d() +
  facet_matrix(vars(everything()), layer.diag = 2, layer.upper = 3, 
               grid.y.diag = FALSE)

While we could call this a day and be pretty pleased with ourselves, I’ll need to show the final party trick of facet_matrix(). The above example was kind of easy because all the variables were continuous. What if we had a mix?

ggplot(mpg, aes(x = .panel_x, y = .panel_y)) + 
  geom_point(shape = 16, size = 0.5) + 
  facet_matrix(vars(fl, displ, hwy))

As we can see facet_matrix() itself handles the mix of scale types quite well, but geom_point() is not that telling when used on a mix of continuous and discrete position scales. ggforce handles this by providing a new position adjustment (position_auto()) that jitters the data based on the scale types. For continuous vs discrete it does a sina-like jitter, whereas for discrete vs discrete it jitters inside a disc (continuous vs continuous makes no jitter):

ggplot(mpg, aes(x = .panel_x, y = .panel_y)) + 
  geom_point(shape = 16, size = 0.5, position = 'auto') + 
  facet_matrix(vars(fl, displ, hwy))

geom_autodensity() and geom_autohistogram() also knows how to handle both discrete and continuous data, so these can be used safely in all circumstances (here also showing that you can of course also map other aesthetics):

ggplot(mpg, aes(x = .panel_x, y = .panel_y, fill = drv, colour = drv)) + 
  geom_point(shape = 16, size = 0.5, position = 'auto') + 
  geom_autodensity(alpha = 0.3, colour = NA, position = 'identity') + 
  facet_matrix(vars(fl, displ, hwy), layer.diag = 2)

Lastly, if you need to use a geom that only makes sense with a specific combination of scales, you can pick these layers directly, though you may end up fiddling a bit to get all the right layers where you want them:

ggplot(mpg, aes(x = .panel_x, y = .panel_y, fill = drv, colour = drv)) + 
  geom_point(shape = 16, size = 0.5, position = 'auto') + 
  geom_autodensity(alpha = 0.3, colour = NA, position = 'identity') + 
  geom_smooth(aes(colour = NULL, fill = NULL)) + 
  facet_matrix(vars(fl, displ, hwy), layer.diag = 2, layer.continuous = TRUE,
               layer.mixed = -3, layer.discrete = -3)

The last example I’m going to show, is simply that you don’t have to create symmetric grids. By default facet_matrix() sets the column selection to be the same as the row selection, but you can overwrite that:

ggplot(mpg, aes(x = .panel_x, y = .panel_y)) + 
  geom_point(shape = 16, size = 0.5, position = 'auto') + 
  facet_matrix(vars(manufacturer, hwy), vars(drv, cty))

As you can hopefully appreciate, facet_matrix() is maximally flexible, while keeping the API of the standard use cases relatively clean. The lack of a ggplot2-like API for plotting different variables against each others in a grid has been a major annoyance for me, and I’m very pleased with how I finally solved it—I hope you’ll put it to good use as well.

Who needs two dimensions anyway?

The last new pack of facets are more benign, but something repeatedly requested. facet_row() and it’s cousin facet_col() are one-dimensional mixes of facet_grid() and facet_wrap(). They arrange the panels in a single row or single column respectively (like setting nrow or ncol to 1 in facet_wrap()), but by doing so allows the addition of a space argument as known from facet_grid(). In contrast to using facet_grid() with a single column or row, these new facets retain the facet_wrap() ability of having completely separate scale ranges as well as positioning the facet strip wherever you please:

ggplot(mpg) + 
  geom_bar(aes(x = manufacturer)) + 
  facet_col(~drv, scales = 'free_y', space = 'free', labeller = label_both) + 
  coord_flip()

So, these were the flurry of facets I was going to bring you today—I hope you’ll put them to good use and create some awesome visualizations with them.

Next up: the next ggraph release!

The ggforce Awakens (again)

Thu, 07 Mar 2019 00:00:00 GMT

After what seems like a lifetime (at least to me), a new feature release of ggforce is available on CRAN. ggforce is my general purpose extension package for ggplot2, my first early success, what got me on twitter in the first place, and ultimately instrumental in my career move towards full-time software/R development. Despite this pedigree ggforce haven’t really received much love in the form of a feature release since, well, since it was released. One of the reasons for this is that after the first release I began pushing changes to ggplot2 that allowed for different stuff I wanted to do in ggforce, so the release of the next ggforce version became tied to the release of ggplot2. This doesn’t happen every day, and when it eventually transpired, I was deep in patchwork and gganimate development, and couldn’t take time off to run the last mile with ggforce. In the future I’ll probably be more conservative with my ggplot2 version dependency, or at least keep it out of the main branch until a ggplot2 release is in sight.

Enough excuses though, a new version is finally here and it’s a glorious one. Let’s celebrate! This version both brings a slew of refinements to existing functionality as well as a wast expanse of new features, so there’s enough to dig into.

New features

This is why we’re all here, right? The new and shiny! Let’s get going; the list is pretty long.

The Shape of Geoms

Many of the new and current geoms and stats in ggforce are really there to allow you to draw different types of shapes easily. This means that the workhorse of these has been geom_polygon(), while ggforce provided the means to describe the shapes in meaningful ways (e.g. wedges, circles, thick arcs). With the new release all of these geoms (as well as the new ones) will use the new geom_shape() under the hood. The shape geom is an extension of the polygon one that allows a bit more flourish in how the final shape is presented. It does this by providing two additional parameters: expand and radius, which will allow fixed unit expansion (and contraction) of the polygons as well as rounding of the corners based on a fixed unit radius. What do I mean with fixed unit? In the same way as the points in geom_point stay the same size during resizing of the plot, so does the corner radius and expansion of the polygon.

Let us modify the goem_polygon() example to use geom_shape() to see what it is all about:

library(ggforce)

ids <- factor(c("1.1", "2.1", "1.2", "2.2", "1.3", "2.3"))
values <- data.frame(
  id = ids,
  value = c(3, 3.1, 3.1, 3.2, 3.15, 3.5)
)
positions <- data.frame(
  id = rep(ids, each = 4),
  x = c(2, 1, 1.1, 2.2, 1, 0, 0.3, 1.1, 2.2, 1.1, 1.2, 2.5, 1.1, 0.3,
  0.5, 1.2, 2.5, 1.2, 1.3, 2.7, 1.2, 0.5, 0.6, 1.3),
  y = c(-0.5, 0, 1, 0.5, 0, 0.5, 1.5, 1, 0.5, 1, 2.1, 1.7, 1, 1.5,
  2.2, 2.1, 1.7, 2.1, 3.2, 2.8, 2.1, 2.2, 3.3, 3.2)
)
datapoly <- merge(values, positions, by = c("id"))

# Standard look
ggplot(datapoly, aes(x = x, y = y)) +
  geom_polygon(aes(fill = value, group = id))

# Contracted and rounded
ggplot(datapoly, aes(x = x, y = y)) +
  geom_shape(aes(fill = value, group = id), 
             expand = unit(-2, 'mm'), radius = unit(5, 'mm'))

If you’ve never needed this, it may be the kind of thing you go why even bother, but if you’ve needed to venture into Adobe Illustrator to add this kind of flourish it is definitely something where you appreciate the lack of this round-trip. And remember: you can stick this at anything that expects a geom_polygon — not just the ones from ggforce.

More shape primitives

While geom_shape() is the underlying engine for drawing, ggforce adds a bunch of new shape parameterisations, which we will quickly introduce:

geom_ellipse makes, you guessed it, ellipses. Apart from standard ellipses it also offers the possibility of making super-ellipses so if you’ve been dying to draw those with ggplot2, now is your time to shine.

# Not an ordinary ellipse — a super-ellipse
ggplot() +
  geom_ellipse(aes(x0 = 0, y0 = 0, a = 6, b = 3, angle = -pi / 3, m1 = 3)) +
  coord_fixed()

geom_bspline_closed allows you to draw closed b-splines. It takes the same type of input as geom_polygon but calculates a closed b-spline from the corner points instead of just connecting them.

# Create 6 random control points
controls <- data.frame(
  x = runif(6),
  y = runif(6)
)

ggplot(controls, aes(x, y)) +
  geom_polygon(fill = NA, colour = 'grey') +
  geom_point(colour = 'red') +
  geom_bspline_closed(alpha = 0.5)

geom_regon draws regular polygons of a set radius and number of sides.

ggplot() +
  geom_regon(aes(x0 = runif(8), y0 = runif(8), sides = sample(3:10, 8),
                 angle = 0, r = runif(8) / 10)) +
  coord_fixed()

geom_diagonal_wide draws thick diagonals (quadratic bezier paths with the two control points pointing towards each other but perpendicular to the same axis)

data <- data.frame(
  x = c(1, 2, 2, 1, 2, 3, 3, 2),
  y = c(1, 2, 3, 2, 3, 1, 2, 5),
  group = c(1, 1, 1, 1, 2, 2, 2, 2)
)

ggplot(data) +
  geom_diagonal_wide(aes(x, y, group = group))

Is it a Sankey? Is it an Alluvial? No, It’s a Parallel Set

Speaking of diagonals, one of the prime uses of this is for creating parallel sets visualizations. There’s a fair bit of nomenclature confusion with this, so you may know this as Sankey diagrams, or perhaps alluvial plots. I’ll insist that Sankey diagrams are specifically for following flows (and often employs a more loose positioning of the axes) and alluvial plots are for following temporal changes, but we can all be friends no matter what you call it. ggforce allows you to create parallel sets plots with a standard layered geom approach (for another approach to this problem, see the ggalluvial package). The main problem is that data for parallel sets plots are usually not represented very well in the tidy format expected by ggplot2, so ggforce further provides a reshaping function to get the data in line for plotting:

titanic <- reshape2::melt(Titanic)
# This is how we usually envision data for parallel sets
head(titanic)

##   Class    Sex   Age Survived value
## 1   1st   Male Child       No     0
## 2   2nd   Male Child       No     0
## 3   3rd   Male Child       No    35
## 4  Crew   Male Child       No     0
## 5   1st Female Child       No     0
## 6   2nd Female Child       No     0

# Reshape for putting the first 4 columns as axes in the plot
titanic <- gather_set_data(titanic, 1:4)
head(titanic)

##   Class    Sex   Age Survived value id     x    y
## 1   1st   Male Child       No     0  1 Class  1st
## 2   2nd   Male Child       No     0  2 Class  2nd
## 3   3rd   Male Child       No    35  3 Class  3rd
## 4  Crew   Male Child       No     0  4 Class Crew
## 5   1st Female Child       No     0  5 Class  1st
## 6   2nd Female Child       No     0  6 Class  2nd

# Do the plotting
ggplot(titanic, aes(x, id = id, split = y, value = value)) +
  geom_parallel_sets(aes(fill = Sex), alpha = 0.3, axis.width = 0.1) +
  geom_parallel_sets_axes(axis.width = 0.1) +
  geom_parallel_sets_labels(colour = 'white')

As can be seen, the parallel sets plot consist of several layers, which is something required for many, more involved, composite plot types. Separating them into multiple layers gives you more freedom without over-poluting the argument and aesthetic list.

The markings of a great geom

If there is one thing of general utility lacking in ggplot2 it is probably the ability to annotate data cleanly. Sure, there’s geom_text()/geom_label() but using them requires a fair bit of fiddling to get the best placement and further, they are mainly relevant for labeling and not longer text. ggrepel has improved immensely on the fiddling part, but the lack of support for longer text annotation as well as annotating whole areas is still an issue.

In order to at least partly address this, ggforce includes a family of geoms under the geom_mark_*() moniker. They all behaves equivalently except for how they encircle the given area(s). The 4 different geoms are:

geom_mark_rect() encloses the data in the smallest enclosing rectangle
geom_mark_circle() encloses the data in the smallest enclosing circle
geom_mark_ellipse() encloses the data in the smallest enclosing ellipse
geom_mark_hull() encloses the data with a concave or convex hull

All the enclosures are calculated at draw time so respond to resizing (most are susceptible to changing aspect ratios), and further uses geom_shape() with a default expansion and radius set, so that the enclosure is always slightly larger than the data it needs to enclose.

Just to give a quick sense of it, here’s an example of geom_mark_ellipse()

ggplot(iris, aes(Petal.Length, Petal.Width)) +
  geom_mark_ellipse(aes(fill = Species)) +
  geom_point()

If you simply want to show the area where different classes appear, we’re pretty much done now, as the shapes along with the legend tells the story. But I promised you some more: textual annotation. So how does this fit into it all?

In addition to the standard aesthetics for shapes, the mark geoms also take a label and description aesthetic. When used, things get interesting:

ggplot(iris, aes(Petal.Length, Petal.Width)) +
  geom_mark_ellipse(aes(fill = Species, label = Species)) +
  geom_point()

The text is placed automatically so that it does not overlap with any data used in the layer, and it responds once again to resizing, always trying to find the most optimal placement of the text. If it is not possible to place the desired text it elects to not show it at all.

Anyway, in the plot above we have an overabundance of annotation. Both the legend and the labels. Further, we often want to add annotations to specific data in the plot, not all of it. We can put focus on setosa by ignoring the other groups:

desc <- 'This iris species has a markedly smaller petal than the others.'
ggplot(iris, aes(Petal.Length, Petal.Width)) +
  geom_mark_ellipse(aes(filter = Species == 'setosa', label = 'Setosa', 
                        description = desc)) +
  geom_point()

We are using another one of the mark geom family’s tricks here, which is the filter aesthetic. It makes it quick to specify the data you want to annotate, but in addition the remaining data is remembered so that any annotation doesn’t overlap with it even if it is not getting annotated (you wouldn’t get this if you pre-filtered the data for the layer). Another thing that happens behind the lines is that the description text automatically gets word wrapping, based on a desired width of the text-box (defaults to 5 cm).

The mark geoms offer a wide range of possibilities for styling the annotation, too many to go into detail with here, but rest assured that you have full control over text appearance, background, line, distance between data and text-box etc.

Lost in Tessellation

The last of the big additions in this release is a range of geoms for creating and plotting Delaunay triangulation and Voronoi tessellation. How often do you need that, you ask? Maybe never… Does it look wicked cool? Why, yes!

Delaunay triangulation is a way to connect points to their nearest neighbors without any connections overlapping. By nature, this results in triangles being created. This data can either be thought of as a set of triangles, or a set of line segments, and ggforce provides both through the geom_delaunay_tile() and geom_delaunay_segment() geoms. Further, a geom_delaunay_segment2() version exists that mimics geom_link2 in allowing aesthetic interpolation between endpoints.

As we are already quite acquainted with the Iris dataset, let’s take it for a whirl again:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
  geom_delaunay_tile(alpha = 0.3) + 
  geom_delaunay_segment2(aes(colour = Species, group = -1), size = 2,
                         lineend = 'round')

The triangulation is not calculated at draw time and is thus susceptible to range differences on the x and y axes. To combat this it is possible to normalize the position data before calculating the triangulation.

Voronoi tessellation is sort of an inverse of Delaunay triangulation. it draws perpendicular segments in the middle of all the triangulation segments and connects the neighboring ones. The end result is a tile around each point marking the area where the point is the closest one. In parallel to the triangulation, Voronoi also comes with both a tile and a segment version.

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + 
  geom_voronoi_tile(aes(fill = Species, group = -1L)) + 
  geom_voronoi_segment() +
  geom_point()

We need to set the group aesthetic to a scalar in order to force all points to be part of the same tessellation. Otherwise each group would get its own:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + 
  geom_voronoi_tile(aes(fill = Species), colour = 'black')

Let’s quickly move on from that…

As a Voronoi tessellation can in theory expand forever, we need to define a bounding box. The default is to expand an enclosing rectangle 10% to each side, but you can supply your own rectangle, or even an arbitrary polygon. Further, it is possible to set a radius bound for each point instead:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + 
  geom_voronoi_tile(aes(fill = Species, group = -1L), max.radius = 0.2,
                    colour = 'black')

This functionality is only available for the tile geom, not the segment, but this will hopefully change with a later release.

A last point, just to beat a dead horse, is that the tile geoms of course inherits from geom_shape() so if you like them rounded corners you can have it your way:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + 
  geom_voronoi_tile(aes(fill = Species, group = -1L), max.radius = 1,
                    colour = 'black', expand = unit(-0.5, 'mm'), 
                    radius = unit(0.5, 'mm'), show.legend = FALSE)

Zoom

Not a completely new feature as the ones above, but facet_zoom() has gained enough new power to warrant a mention. The gist of the facet is that it allows you to zoom in on an area of the plot while keeping the original view as a separate panel. The old version only allowed specifying the zoom region by providing a logical expression that indicated what data should be part of the zoom, but it now has a dedicated xlim and ylim arguments to set them directly.

ggplot(diamonds) + 
  geom_histogram(aes(x = price), bins = 50) + 
  facet_zoom(xlim = c(3000, 5000), ylim = c(0, 2500), horizontal = FALSE)

The example above shows a shortcoming in simply zooming in on a plot. Sometimes the resolution (here, bins) aren’t really meaningful for zooming. Because of this, facet_zoom() has gotten a zoom.data argument to indicate what data to put on the zoom panel and what to put on the overview panel (and what to put in both places). It takes a logical expression to evaluate on the data and if it returns TRUE the data is put in the zoom panel, if it returns FALSE it is put on the overview panel, and if it returns NA it is put in both. To improve the visualization above, well add two layers with different number of bins and use zoom.data to put them in the right place:

ggplot() + 
  geom_histogram(aes(x = price), dplyr::mutate(diamonds, z = FALSE), bins = 50) + 
  geom_histogram(aes(x = price), dplyr::mutate(diamonds, z = TRUE), bins = 500) + 
  facet_zoom(xlim = c(3000, 5000), ylim = c(0, 300), zoom.data = z,
             horizontal = FALSE) + 
  theme(zoom.y = element_blank(), validate = FALSE)

The last flourish we did above was to remove the zoom indicator for the y axis zoom by using the zoom.y theme element. We currently need to turn off validation for this to work as ggplot2 by default doesn’t allow unknown theme elements.

All the rest

The above is just the most worthwhile, but the release also includes a slew of other features and improvements. Notable mentions are

geom_sina() rewrite to allow dodging and follow the shape of geom_violin()
position_jitternormal() that jitters points based on a normal distribution instead of a uniform one
facet_stereo() to allow for faux 3D plots

See the NEWS.md file for the full list.

Further, ggforce now has a website at https://ggforce.data-imaginist.com, with full documentation overview etc. This is something I plan to roll out to all my major packages during the next release cycle. I’ve found that it is a great incentive to improve the examples in the documentation!

I do hope that it won’t take another two years before ggforce sees the next big update. It is certainly a burden of my shoulder to get this out of the door and I hope I can adhere to smaller, more frequent, releases in the future.

Now go get plotting!

gganimate has transitioned to a state of release

Thu, 03 Jan 2019 00:00:00 GMT

Just to start of the year in a positive way, I’m happy to announce that gganimate is now available on CRAN. This release is the result of a pretty focused development starting in the spring 2018 prior to my useR keynote about it.

Some History

The gganimate package has been around for quite some time now with David Robinson making the first commit in early 2016. David’s vision of gganimate revolved around the idea of frame-as-an-aesthetic and this easy-to-grasp idea gave it an early success. The version developed by David never made it to CRAN, and as part of ramping down his package development he asked me if I was interested in taking over maintenance. I was initially reluctant because I wanted a completely different API, but he insisted that he supported a complete rewrite. The last version of gganimate as maintained by David is still available but I very quickly made some drastic changes:

While this commit was done in the autumn 2017, nothing further happened until I decided to make gganimate the center of my useR 2018 keynote, at which point I was forced (by myself) to have some sort of package ready by the summer of 2018.

A fair amount of users have shown displeasure in the breaking changes this history has resulted in. Many blog posts have already been written focusing on the old API, as well as code on numerous computers that will no longer work. I understand this frustration, of course, but both me and David agreed that doing it this way was for the best in the end. I’m positive that the new API has already greatly exceeded the mind-share of the old API and given a year the old API will be all but a distant memory…

The Grammar

Such drastic breaking changes were required because of a completely different vision for how animation fitted into the grammar of graphics. Davids idea was that it was essentially a third dimension in the graphic and the animation was simply flipping through slices along the third dimension in the same way as you would look through the output of a CT scan. Me, on the other hand, wanted a grammar that existed in parallel to the grammar of graphics — not as part of it.

My useR keynote goes in to a lot of detail about my motivation and inspiration for taking on this approach, and I’ll not rehash it in this release post. Feel free to take a 1h break from reading as you watch the talk

The gist of it all is that animations are a multifaceted beast and requires much more than an additional aesthetic to be tamed. One of the cornerstones of the talk is the separation of animations into scenes and segues. In short, a segue is an animated change in the underlying laws of the graphic (e.g. changes to coordinate systems, scales, mappings, etc.), whereas a scene is a change in the data on display. Scenes are concerned with what and segues are concerned with how. This separation is important for several reasons: It gives me a natural focus area for the current version of gganimate (scenes), it serves as a theoretical backbone to group animation operation, and it is a central limit in animation good practices: “You should never change how and what at the same time”.

So, the version I’m presenting here is a grammar of animation uniquely focused on scenes. This does not mean that I’ll never look into segues, but they are both much harder, and less important than getting a scene grammar to make sense, so segues have to play second fiddle for now.

What’s in a scene

There are two main components to a scene: What we are looking at, and where we are looking from. The former is handled by transitions and shadows, whereas the latter is handled by views. In brief:

transitions populates the frames of the animation with data, based on the data assigned to each layer. Several different transitions exists that interpret the layer data differently.
shadows gives memory to each frame by letting each frame include data from prior or future frames.
views allow you to modify the range of the positional scales (zoom and pan) either directly or as a function of the data assigned to the frame.

On top of these three main grammar components there is a range of functions to modify how key parts of animations behave — for a general introduction to the ins and outs of the API, please see the *Getting Started** guide.

Grammar vs API

While it may appear that grammar and API are the same, this is not the case. A grammar is a theoretical construct, a backbone from which an API can be defined. Several APIs could implement the same grammar in multiple, incompatible, ways. For gganimate I have tried to align the API as much as possible with the ggplot2 API, so that the line between the two packages becomes blurred. You change a plot to an animation by adding functions from gganimate to it, and the animation is rendered when printing the animation object in the same way as ggplots are rendered when printing the object. An example of this is adding transition_reveal() to a plot to make it appear gradually along a numeric variable:

library(ggplot2)
library(gganimate)

ggplot(airquality) + 
  geom_line(aes(x = Day, y = Temp, group = Month))

last_plot() + 
  transition_reveal(Day)

For the most part, the marriage between the ggplot2 and gganimate APIs is a happy one, though it does show at points that the ggplot2 API was never designed with animation in mind. I am particularly pleased with how powerful the API has turned out, and I have already seen countless uses I had never anticipated.

Making Fireworks

While a proper introduction to its use is better kept for a separate document (such as the Getting Started guide mentioned earlier), I think I would do gganimate a disservice by not showing of at least a single fully fledged example. Below is the code needed to make fireworks with gganimate:

# Firework colours
colours <- c(
  'lawngreen',
  'gold',
  'white',
  'orchid',
  'royalblue',
  'yellow',
  'orange'
)
# Produce data for a single blast
blast <- function(n, radius, x0, y0, time) {
  u <- runif(n, -1, 1)
  rho <- runif(n, 0, 2*pi)
  x <- radius * sqrt(1 - u^2) * cos(rho) + x0
  y <- radius * sqrt(1 - u^2) * sin(rho) + y0
  id <- sample(.Machine$integer.max, n + 1)
  data.frame(
    x = c(x0, rep(x0, n), x0, x),
    y = c(0, rep(y0, n), y0, y),
    id = rep(id, 2),
    time = c((time - y0) * runif(1), rep(time, n), time, time + radius + rnorm(n)),
    colour = c('white', rep(sample(colours, 1), n), 'white', rep(sample(colours, 1), n)),
    stringsAsFactors = FALSE
  )
}
# Make 20 blasts
n <- round(rnorm(20, 30, 4))
radius <- round(n + sqrt(n))
x0 <- runif(20, -30, 30)
y0 <- runif(20, 40, 80)
time <- runif(20, max = 100)
fireworks <- Map(blast, n = n, radius = radius, x0 = x0, y0 = y0, time = time)
fireworks <- dplyr::bind_rows(fireworks)

All of the above is just data preparation. blast() simply creates segments from the center of the blast and out to the periphery, sampling colours from the colour vector. The end result, if plotted statically, looks like this:

ggplot(fireworks) + 
  geom_path(aes(x = x, y = y, group = id, colour = colour)) + 
  scale_colour_identity()

Now, to make it all move, as well as style it a bit for a better effect

ggplot(fireworks) + 
  geom_point(aes(x, y, colour = colour, group = id), size = 0.5, shape = 20) + 
  scale_colour_identity() + 
  coord_fixed(xlim = c(-65, 65), expand = FALSE, clip = 'off') +
  theme_void() + 
  theme(plot.background = element_rect(fill = 'black', colour = NA), 
        panel.border = element_blank()) + 
  # Here comes the gganimate code
  transition_components(time, exit_length = 20) + 
  ease_aes(x = 'sine-out', y = 'sine-out') + 
  shadow_wake(0.05, size = 3, alpha = TRUE, wrap = FALSE, 
              falloff = 'sine-in', exclude_phase = 'enter') + 
  exit_recolour(colour = 'black')

While I wont go into detail, transition_component() allow all the points to follow their own trajectory and timeline independently, ease_aes() ensures that the velocity of the points taper off, shadow_wake() is responsible for the trail after each point, and exit_recolour() makes sure the points gradually fades into the black background once they “burn out”.

The Future

While this release is a milestone for gganimate, it is not a signal of it being done as many things are still missing (even if we ignore the whole segue part of the grammar). It does signal a commitment to stability from now on, though so you should feel confident in using this package without fearing that your code will break in the future. You can follow the state of the package at its website, , where I’ll also try to add additional guides and tutorials with time. If you create something with gganimate please share it on twitter, as I’m eager to see what people will make of it.

I’ll do a sort-of live cookbook talk on gganimate at this years RStudio conf in Austin, so if you are there and interested to learn more about the package do swing by.

Now, Go Animate!

Entering and Exiting 2018

Wed, 02 Jan 2019 00:00:00 GMT

The year is nearly over and it is the time for reflection and navel-gazing. I don’t have incredibly profound things to say, but a lot of things happened in 2018 and this is as good a time as any to go through it all…

Picking Myself Up

The prospects of my “2017 in review” post were not particularly rosy… I had hit somewhat of a burnout in terms of programming, but was none the less positive and had a great job and a lot of positive feedback on patchwork. Further, I had RStudio::conf to look forward to, which would be my first IRL head-to-head with the R community at large. I had also promised to present a fully-fledged tidy approach to network analysis and while both ggraph and tidygraph had already been released there were things I wanted to develop prior to presenting it. All-in-all there was a great impediment to pick myself up and get on with developing (not arguing that this is a fail-safe way to deal with burnout by the way).

RStudio::conf(2018L)

My trip to San Diego was amazing. If you ever get to go to an RStudio conference I don’t think you will be disappointed (full disclosure and spoiler-alert: I now work for RStudio). My suspicion that the R community is as amazing in real life as on Twitter was confirmed and it was great to finally get to see all those people I admire and look up to. My talk went fairly well I think — I haven’t watched the recordings as I don’t particularly enjoy watching myself talk, but you can, if you are so inclined. At the conference I got to chat a bit with Jenny Bryan (one of the admire/look-up-to people referenced above) and we discussed what we were going to talk about in our respective keynotes at useR in Brisbane in the summer. I half-jokingly said that I might talk about gganimate because that would give me the required push to actually begin developing it…

Talk-Driven Development

Around April Dianne Cook was getting pushy with getting at least a talk title for my keynote, and at that point I had already imagined a couple of slides on gganimate and thought “to heck-with-it” and responded with the daunting title of The Grammar of Animation. At that point I had still not written a single line of code for gganimate, and knew that tweenr would need a serious update to support what I had in mind. In addition, I knew I had to develop what ended up as transformr before I could begin with gganimate proper. All-in-all my talk title could not be more stress-inducing…

Thankfully I had a pretty clear vision in my head (which was also why I wanted to talk about it) so the motivation was there to drag me along for the ride. Another great benefit of developing tools for data visualisation in general and animation in particular, is that it sets Twitter on fire. After getting tweenr and transformr into a shape sufficient to support gganimate, I began to create the backbone of the package, and once I shared the first animation created with it, it was clear that I was in the pursuit of something that resonated with a lot of people.

To my great surprise I was able to get gganimate to a state where it actually supported the main grammar I had in mind prior useR, and I could begin to make the presentation I had in mind:

useR was a great experience, not only because I was able to give the talk I had hoped for, but also due to the greatness of the organisers and the attendees. I was able to get to meet a lot of the members of R Core for the first time and they were very supportive of my quest to improve the performance of the R graphic stack (last slide of my talk), so I had high hopes that this might be achievable within the next 5-10 years (it is no small task). I had been surprised about the support for my ideas about animations and their relevance within the R community, so in general the conference left my invigorated and with the stamina to complete gganimate.

Intermezzo

I managed to release a couple of packages that do not fit into the narrative I’m trying to create for this year, but they deserve a mention none the less.

In the beginning of the year I was able to finish of particles, a port and extension of the d3-force algorithm developed by Mike Bostock. It can be used for both great fun and work and did among other things result in this beautiful pixel-decomposition of Hadley:

While making improvements to tweenr in anticipation of gganimate it became clear that colour conversion was a main bottleneck and I ended up developing farver to improve on this. Beyond very fast colour conversion it also allow a range of different colour distance calculations to be performed. Some of the discussion that followed the development of this package led to Brodie Gaslam improving the colour conversion performance in base R and while it is not as fast as farver, it is pretty close and future versions of R will definetly benefit from his great contribution.

I haven’t had much time to make generative art this year, but I did manage to find time for some infrastructure work that will support my endavours in this space in the future. The ambient package is able to produce all sorts of multidimensional noise in a very performant way due to the speed of the underlying C++ library. I’m planning to expand on this package quite a bit when I get the time as I have lots of cool ideas for how to manipulate noise in a tidy manner.

How you use colours in data visualisation is extremely important, which is also why the data visualisation community has embraced the viridis colour scale to the extend that they have. I’ve personally grown tired of the aesthetic though, so when I saw a range of perceptualy uniform palettes developed by Fabio Crameri was quick to bring it to R with the scico package. To my surprise the development of a colour palette packages became my most contentious contribution this year (that I know of), so I welcome everyone who is tired of colour palette packages to ignore it alltogether.

transition_hobby_work()

Prior to useR I had began to receives some cryptic questions from Hadley and it was clear that he was either trolling my or that something was brewing. During the late summer it became clear that it was the latter (thankfully), as RStudio wanted me to work full time on improving the R graphic stack. Working for RStudio on something so aligned with my own interest is beyond what I had hoped for, so despite my joy in working for the danish tax authorities the switch was a no-brainer. I wish my former office all the best — they are doing incredible work - and look forward to seeing some of them at RStudio conf in Austin later in the month.

Being part of the tidyverse team has so far been a great experience. I’ve been lucky enough to meet several of them already as part of the different conferences I attended this year, so working remotely with them doesn’t feel that strange. It can be intimidating to work with such a talented team, but if that is the least of my concerns I’m pretty sure I can manage that.

I look forward to share the performance improvements I’m making with all of you throughout the coming years, and hopefully I’ll have time to also improve on some of my packages that has received less attention during the development of gganimate.

Happy New Year!

transformr: Age of Spatial

Sun, 09 Dec 2018 00:00:00 GMT

Once again, I gives me great pleasure to announce a new package has joined CRAN. transformr is the spatial brother of tweenr and as with the tweenr update a few months ago, this package is very much driven by the infrastructural needs of gganimate. It is probably the last piece needed before I can begin preparing gganimate for CRAN, so if you are waiting for that there is indeed reason for celebration.

Becoming Spatial

As written above, transformr is tweenr for spatial data (spatial being used in a very broad sense as any data that is partly coordinates). To understand what this means we’ll briefly have to touch on a core concept of tweenr. What is never said out loud, but generally implied, is that tweenr treats all columns of the data frame as independent. This is generally a sound principle as you don’t want values from other columns to influence how e.g. the colour transitions between black and blue. As far as spatial is concerned, this approach also works fine as each row in the data frame encodes a single independent point in space or if there’s a one-to-one mapping between points in a polygon. Alas, the devil’s in the detail, and tweenr breaks down in magnificent ways if you try to tween between more complicated and heterogeneous shapes, e.g. a star and a circle. This is not something unique to tweenr, mind you, d3.js also has this limitation. The problems in d3 led Noah Veltman to develop the flubber javascript library. His reasons for developing it is succintly described in the animation below, grabbed from the readme of flubber

The Trials of the Polygon

So, what’s the deal with polygons exactly. Why don’t they just do as you expect them to and morph naturally from one to the other. That sad state of affair is that there are multiple reasons for that:

There might be discrepancy between the number of points that make up the two polygons. This may lead to part of the shape simply appearing or disappearing at the start or end of the tween.
The winding of the polygons may have a different angular offset and/or direction. This means that the tween will include rotatation and/or inversion, something that is often undesirable.
There may be a discrepancy in the number of polygons that make up the two shapes you tween between and/or a discrepancy between the number of holes. As with 1. this may lead to parts of the shapes suddenly appearing or disappearing during the tween.

Running the Gauntlet

transformr tries to solve the three problems above in much the same way as flubber does, at least conceptually. There are enough differences between how Javascript and R (as well as d3 and tweenr) works with data, that I decided to only take the ideas behind flubber and implement them in my own way, in a manner fitting for R, rather than doing a direct port of the library. This means that you cannot expect the two libraries to behave equivalently. Below is, at a very high level, what transformr does to address the 3 problems outlined above:

Points are added along the edges of the shape with the fewest corners until the number of points matches between the shapes. Points are added so that long edges will be divided more often than short edges in order to even out the edge lengths of the final shape. Further, if any shape has fewer than a given number of corners, points will be added (following the same strategy) until the number of corners is reached.
After the number of points are evened out, the winding direction is matched between the shapes (as clockwise), and the last shape is rotated until the squared distance between point pairs of the two shapes is minimised.
This is adressed first (but is the least prevalent problem so it is mentioned last). If there are different number of polygons in the two states you wish to tween between, the polygons in the state with the fewest polygons is cut until the number matches. Once again, the cuts are distributed so that large polygons are cut more often than small. After the cutting, polygons between the states are matched by minimising distance and area difference. If there are differences in the number of holes in the matched polygons zero-area holes are inserted at the gravitational center of the polygon with the fewest holes until the number matches.

The Ways of the Transformr

At this point we have only talked about shapes (and polygons), so let’s get a bit more concrete. transformr currently recognises three data types: polygons, paths, and simple features. Polygons encompass simple polygons as well as polygons with any number of holes. Paths can be either single or multipaths. Simple features as implemented by the sf package are supported, currently covering the (multi)point, (multi)path, and (multi)polygon types.

In terms of tween type support, transformr currently extends the tween_state() API from tweenr but support for the other types of tweeners will be added with time.

Some Examples

At this point an example is probably in order. We’ll start with what we first identified as a problematic case: morphing between a circle and a star:

library(transformr)
library(ggplot2)

# Helpers included in transformer
circle <- poly_circle()
star <- poly_star()

# The data is a simple data.frame as you would feed into ggplot2
head(star)

##              x          y id
## 1 0.000000e+00  1.0000000  1
## 2 2.938926e-01  0.4045085  1
## 3 9.510565e-01  0.3090170  1
## 4 4.755283e-01 -0.1545085  1
## 5 5.877853e-01 -0.8090170  1
## 6 6.123234e-17 -0.5000000  1

# We use tween_polygon to morph between the two
morph <- tween_polygon(circle, star, 
                       ease = 'linear',
                       id = id,
                       nframes = 12)

# You get back a data.frame with the same special columns as with tweenr
head(morph)

##            x         y id .id .phase .frame
## 1 0.00000000 1.0000000  1   1    raw      1
## 2 0.01745241 0.9998477  1   1    raw      1
## 3 0.03489950 0.9993908  1   1    raw      1
## 4 0.05233596 0.9986295  1   1    raw      1
## 5 0.06975647 0.9975641  1   1    raw      1
## 6 0.08715574 0.9961947  1   1    raw      1

# Let's see the result
ggplot(morph) + 
  geom_polygon(aes(x = x, y = y, group = id), fill = NA, colour = 'black') + 
  facet_wrap(~.frame, labeller = label_both, ncol = 3) + 
  theme_void()

What would happen if we upped the stakes a bit? Let’s try with a star with a hole, morphing into three circles:

circles <- poly_circles()
star_hole <- poly_star_hole()

morph <- tween_polygon(circles, star_hole, 
                       ease = 'linear',
                       id = id, 
                       nframes = 12,
                       match = FALSE)

ggplot(morph) + 
  geom_polygon(aes(x = x, y = y, group = id), fill = NA, colour = 'black') + 
  facet_wrap(~.frame, labeller = label_both, ncol = 3) + 
  theme_void()

We introduced a new argument in tween_polygon() here. match is used to define whether polygons are matched by the value of id or whether all polygons in the first state should somehow morph into all polygons in the last state. If we set match = TRUE, we can use the enter and exit argument to define what should happen to unmatched polygons

morph <- tween_polygon(circles, star_hole, 
                       ease = 'linear',
                       id = id, 
                       nframes = 12,
                       match = TRUE,
                       exit = function(.x) transform(.x, x = mean(x), y = mean(y)))

ggplot(morph) + 
  geom_polygon(aes(x = x, y = y, group = id), fill = NA, colour = 'black') + 
  facet_wrap(~.frame, labeller = label_both, ncol = 3) + 
  theme_void()

You’ll see a weird glitch above with the hole in the star reaching out to the edge, but this is simply ggplot2 not knowing how to deal with holed polygons in geom_polygon() — I’ll handle that in another post…

What is not shown above is that transformr and tween_polygon() works well together with keep_state() from tweenr and that it is pipe-able, but if you are used to tween_state() this will all come natural…

While path and sf morphing works in much the same way as shown above, I’ll quickly show case it for completeness:

spiral <- path_spiral()
waves <- path_waves()

morph <- tween_path(spiral, waves,
                    ease = 'linear',
                    nframes = 12, 
                    id = id,
                    match = FALSE)

ggplot(morph) + 
  geom_path(aes(x = x, y = y, group = id), colour = 'black') + 
  facet_wrap(~.frame, labeller = label_both, ncol = 3) + 
  theme_void()

circle_st <- sf::st_sf(geometry = sf::st_sfc(poly_circle(st = TRUE)))
north_carolina <- sf::st_read(system.file("shape/nc.shp", package = "sf"), 
                              quiet = TRUE)
north_carolina <- st_normalize(sf::st_combine(north_carolina))
north_carolina <- sf::st_sf(geometry = sf::st_sfc(north_carolina))

morph <- tween_sf(north_carolina, circle_st,
                  ease = 'linear',
                  nframes = 12)

ggplot(morph) + 
  geom_sf(aes(geometry = geometry), colour = 'white', fill = 'black', size = .1) + 
  facet_wrap(~.frame, labeller = label_both, ncol = 3) + 
  coord_sf(datum = NULL) + 
  theme_void()

As can be seen, transformr can handle most of the things you choose to to throw at it, when it comes to morphing between different shapes. It is used under the hood in gganimate to power polygon, path, and sf geom transitions (and derivatives thereof), but can just as well be used directly in the same way as tweenr can…

I do hope you’ll enjoy transformr either simply through the magic of gganimate or by playing with it directly — the results can be quite mesmerizing…

The tweenr is all grown up

Mon, 22 Oct 2018 00:00:00 GMT

NOTE: tweenr was released some time ago but a theft of my computer while writing the release post meant that I only just finished writing about it now.

I’m very happy to once again announce a package update, as a new major version of tweenr is now on CRAN. This release, while significant in itself, is also an important part of getting gganimate on CRAN, so if you care about gganimate but have never heard of tweenr you should be happy nonetheless.

tweenr was my first sort-of popular package and filled a gap in the gganimate version of yore, where smooth transitions were something you had to bring yourself. It has lived almost unchanged since its initial release, but as I began to develop the next iteration of gganimate it became clear that new functionality was needed. Some of it ended up in the transformr but a huge chunk has been added to tweenr itself. A description of everything new follows below.

Something new, something old {new_old}

The main API of the previous version of tweenr comprised of the tween_states(), tween_elements(), and tween_appear(). All of these needed serious change in both capabilities and API to the extend that all prior code would break, so instead I decided to keep the old functions unchanged and create new ones for a brighter future.

tween_states ⇒ tween_state/keep_state

tween_states() was perhaps the most used of the old functions. It takes a list of data.frames and a specification of how long transitions between them should take and how long it should pause at each data.frame. This is perfect for situations where you have discrete states and want to have a smooth transition between them. Still it had certain shortcomings, such as requiring that each data.frame included the same number of rows etc. (meaning that each “element” should be present in each state). Further I have ended up finding it a bit clumsy to use. The new function(s) that should serve the same needs as tween_states() is tween_state() (and keep_state()) and they are much more powerful. The biggest difference, perhaps, is that tween_state() only takes a from and to state and not an arbitrarily long list of states. If transitions are needed between more states you’ll need to chain calls together (potentially with keep_state() if the state should pause between transitions). It will something like this:

library(tweenr)
irises <- split(iris, iris$Species)
iris_tween <- irises$setosa %>% 
  tween_state(irises$versicolor, ease = 'cubic-in-out', nframes = 10) %>% 
  keep_state(5) %>% 
  tween_state(irises$virginica, ease = 'linear', nframes = 15) %>% 
  keep_state(5)

As can be seen, if you like piping you’ll feel right at home. Apart from the seemingly superficial change in API, the tween_state() also packs some new tricks. One of these is per-column easing, so you can specify different easing functions for different variables. Another, more fundamental one, is the possibility of specifying an id to match rows by. Ultimately this means that rows no longer needs to be matched by position and that you can now tween between states with different numbers of elements. All of this is so important that it will get its own section later on in Enter and Exit.

tween_elements ⇒ tween_components

While tween_states() was probably the most used of the legacy functions it was by no means the only one. tween_elements() was a very powerful function that let you specify different individual states for each element in a single data.frame and then expand this to an arbitrary number of frames. The changes that its hier bring is less dramatic than what happened with tween_states(), and simply adds the same features that tween_state() introduced. This means per-column easing and the same features as described below in Enter and Exit. Further it changes the semantics of a couple of variables so they are now tidy evaluated. This means that state specifications can be calculated on the fly, rather than having to exist inside the data.frame to be tweened.

tween_appear ⇒ tween_events

tween_appear() never really felt right. The purpose was to let each row appear in a specific frame while still allowing the user to define how it should appear. To this end it expanded the data.frame by giving each row an age in each frame (negative age meant that it had yet to appear) and then let the user do with this as they pleased. What was missing was the whole idea of Enter and Exit which I have already plugged multiple times. tween_event() is a pretty radical change in order to solve the problem that tween_appear() originally tried (but failed) to solve.

Enter and Exit

Before we go any further with other new tweening functions I think I owe it to you to describe what all this entering and exiting I’ve been talking about really is. If you have dabbled in D3.js the two words will be familiar but they have slightly different meaning in tweenr. In D3 enter and exit describe a selection of data that did not match in a data join between current and next state, while in tweenr it is a function that modifies data that is going to appear or disappear. If enter and/or exit is not given, the data will just pop into existence in the first frame it relates to and disappear without a trace after the last frame it relates to has ended. if you provide e.g. an enter function this will be applied to all elements when they first appear. The result of the function will then be inserted into the tweening prior to the original data so that any changes the function does will gradually change to the original data. This may sound quite confusing, but in essence it means that if you pass in an enter function that sets the transparancy variable to zero you’ll get a gradual fade-in effect. The exit function is just like it, but in reverse. But let’s see it in effect instead:

df1 <- data.frame(x = 1:2, y = 2, alpha = 1) # 2 rows
df2 <- data.frame(x = 2:0, y = 1, alpha = 1) # 3 rows

fade <- function(data) {
  data$alpha <- 0
  data
}

tween <- tween_state(df1, df2, ease = 'linear', nframes = 5, enter = fade)

tween$alpha[tween$.id == 3]

## [1] 0.25 0.50 0.75 1.00

We can see that the alpha value of the third element (the one that doesn’t exist in the first state) gradually increase from zero to one. The reason why 0 isn’t included is that the enter and exit function return virtual states that doens’t remain in the data - only the transition to and from them does.

New tweens on the block

Appart from the new versions of the old functionality discussed above this version also includes some brand new tweens, mainly implemented to serve needs in gganimate but of course also available for everyone else.

tween_along

This is the tweening function that powers transition_reveal() in gganimate - if you have played with that you are fairly well situated to understand what it does. In essence it allows you to specify time points for the different rows in your data.frame and then tween between these. You might think that this is exactly what tween_components() does and you’d partly be right. The big difference is that tween_along() ensures equidistant frames, whereas tween_components() assigns all rows in the data to the nearest frame and then use the frame as a time variable. The latter will always have the raw data appearing in one frame or another while the former will not. Further, tween_along() will optionally keep earlier rows in your data at each frame, which is useful if you e.g. want a line to gradually appear along an axis.

tween_at

This is a pretty low level tweening function intened to get an exact state between two data frames or vectors. It takes two states and a numeric vector giving the tween position for each row and then calculates the intermediary rows.

tween_at(mtcars[1:3, ], mtcars[4:6, ], runif(3), 'linear')

##        mpg      cyl     disp       hp     drat       wt     qsec        vs
## 1 21.27039 6.000000 226.2455 110.0000 3.345701 3.022205 18.47440 0.6759747
## 2 20.51284 6.423621 202.3621 123.7677 3.741142 2.994673 17.02000 0.0000000
## 3 19.77526 5.287121 183.2966 100.7227 3.148519 3.053659 19.64613 1.0000000
##          am     gear     carb
## 1 0.3240253 3.324025 1.972076
## 2 0.7881894 3.788189 3.576379
## 3 0.3564393 3.356439 1.000000

This is unlikely to be useful for directly creating animations (though I guess it is low-level enough to be able to be shoehorned into anything). It is being used in gganimate for calculating shadow falloff in shadow_wake().

tween_fill

This tween takes a page out of tidyr::fill and simply fill out missing elements in a data frame or vector. Instead of being boring and repeating the prior or following data it doesn the tweenr thing and tweens between them:

mtcars2 <- mtcars[1:7, ]
mtcars2[2:6, ] <- NA

tween_fill(mtcars2, 'cubic-in-out')

##        mpg      cyl     disp    hp     drat       wt     qsec vs
## 1 21.00000 6.000000 160.0000 110.0 3.900000 2.620000 16.46000  0
## 2 20.87593 6.037037 163.7037 112.5 3.887222 2.637593 16.44852  0
## 3 20.00741 6.296296 189.6296 130.0 3.797778 2.760741 16.36815  0
## 4 17.65000 7.000000 260.0000 177.5 3.555000 3.095000 16.15000  0
## 5 15.29259 7.703704 330.3704 225.0 3.312222 3.429259 15.93185  0
## 6 14.42407 7.962963 356.2963 242.5 3.222778 3.552407 15.85148  0
## 7 14.30000 8.000000 360.0000 245.0 3.210000 3.570000 15.84000  0
##           am     gear carb
## 1 1.00000000 4.000000    4
## 2 0.98148148 3.981481    4
## 3 0.85185185 3.851852    4
## 4 0.50000000 3.500000    4
## 5 0.14814815 3.148148    4
## 6 0.01851852 3.018519    4
## 7 0.00000000 3.000000    4

Neat-o…

Grab bag of niceties

There are more subtle additions to tweenr as well, most of which has also been driven by gganimate needs. Here’s an unceremonious list:

More information: In olden days tweenr simply added a .frame column to the output to identify the frame the row belonged to. It still does, but that column is now accompagnied by a .phase column that tells if the data is raw, static, entering, exiting, or transitioning, and an .id column that identifies the same data across frames.
More support: The supported data types has been expanded considerably. Most notably list columns are now accepted. If the list only contains numeric vectors these vectors will get tweened accordingly, and if not the list will be treated as constant.
Better colour support: Colour has always been tweened in the LAB representation to get a more natural transition. In the old version the native convertColor() function was used, but this could lead to substantial slowdown when tweening lots of data. To address this farver was developed and released and this version of tweenr naturally uses this for colour space conversions now. In addition, tweenr now supports hex-colours with alpha.

I think that is it, but frankly there has been so many additions that I may have missed a few… First to find something I missed gets a sticker!

What Are We Plotting, What Are We Animating

Mon, 24 Sep 2018 00:00:00 GMT

This is my first blog post about gganimate — a package I’ve been working on since mid-spring this year. I have many thoughts and lots to say about animation and gganimate, so much in fact that it has seemed too big a task to begin writing about. Further, I felt like I had to spend my time developing the thing in the first place.

So this is an alternative entrance into writing about gganimate — sort of a tech-note about a specific problem. There will still come a time for some more formal writing about the theory and use of gganimate but until then I’ll refer to my useR keynote for any words on my thoughts behind it all.

The Problem

When we animate data visualisations we often do it by calculating intermediary data points resulting in a smooth transition between the states represented by the raw data. In gganimate this is done by adding a transition which defines how data should be expanded across the animation frames. Underneath it all most transitions calculate intermediary data representations using tweenr and transformr — so far, so good.

What we have glanced over, and what is at the center of the problem, is what state of the data we decide to use as basis for our expansion. If you are not familiar with ggplot2 and the grammar of graphics this might be a strange phrasing — data is data — but if you are, you’ll know that data can undergo several statistical transformations before it is encoded into a visual property and put on paper (or screen). Some of the states the data undergo are:

Raw data as it is passed into the plotting function
Raw data with only the columns mapped to aesthetics present
Data transformed by a statistic
Data with aesthetics mapped to a scale
Data with default aesthetic values added
Data transformed by the geom

If you prepare your data for animation beforehand (e.g. using tweenr), you’re only able to touch the data at the first state and thus limited in what you can do. If there is a one-to-one mapping between the raw data and the final visual encoding this might not be a problem, but it breaks down spectacularly when the statistic transformation impose a grouping of the data into a shared visual encoding, e.g. a box-plot. Consider the task of calculating intermediary data for a transition from one box-plot showing statistics for 10 points, to another box-plot showing statistics for 15 points. If you could only use the raw data your atomic observations would suddenly have to change from 10 to 15 values in a smooth manner. On the other hand, if you could calculate the statistics used to draw the two box-plots and then calculate intermediary statistics instead, this discrepancy in the underlying data would not pose any problem. Indeed, the latter approach is what is done in gganimate — all data expansion is performed after statistics have been calculated. In fact, all expansion is done when data has reached state 5. Why wait so long? A simple example to explain this is the case of colour (or fill) aesthetics. If they are mapped to a categorical variable there will be no way to create a smooth transition based on the raw data. On the other hand, if we wait until the raw data has been mapped to its final colour value, we may smoothly transition the colour itself, ignoring the fact that the intermediary colours does not correspond to any meaningful category in the raw data.

The Curious Case of Tesselation

So, “what is the problem?”, you may ask. Indeed, this approach is almost universally good, to the extend that you might just ignore the existence of other approaches… But the devils in the detail — let’s make a plot:

library(ggplot2)
library(ggforce)

data <- data.frame(
  x = runif(20),
  y = runif(20),
  state = rep(c('a', 'b'), 10)
)

ggplot(data, aes(x = x, y = y)) + 
  geom_voronoi_tile(fill = 'grey', colour = 'black', bound = c(0, 1, 0, 1)) + 
  geom_point() + 
  facet_wrap(~state)

Now, think about what you would expect a transition between the two panels to look like - my guess is that it is nothing like below:

library(gganimate)
ggplot(data, aes(x = x, y = y)) + 
  geom_voronoi_tile(fill = 'grey', colour = 'black', bound = c(0, 1, 0, 1)) + 
  geom_point() + 
  transition_states(state, transition_length = 3, state_length = 1) + 
  ease_aes('cubic-in-out')

Okay, what is going on? To be honest I had a different expectation about how this would fail when I started writing this. The reason why the voronoi tiles are static (and calculated based on all the points) is that the voronoi tessellation is calculated on the full panel data. At the time the voronoi tile statistic receives the data it all just belongs to the same panel since gganimate differentiate states using the group aesthetics. To show you how I expected this example to break down we’ll have to tell the voronoi stat to tessellate based on the groups instead:

ggplot(data, aes(x = x, y = y)) + 
  geom_voronoi_tile(fill = 'grey', colour = 'black', bound = c(0, 1, 0, 1),
                    by.group = TRUE) + 
  geom_point() + 
  transition_states(state, transition_length = 3, state_length = 1) + 
  ease_aes('cubic-in-out')

Now, at least it is wrong in the way that I expected it to be. Why is this wrong? The tessellation stat outputs polygon data that is then drawn by a polygon geom, so gganimate does the best it can to transition these polygons smoothly between the states. In this example this is not what we expected though. We expect a tessellation to always be true, even during the transition so the tessellation should be calculated for each frame, based on intermediary point positions. In other words, here we want the expansion to happen on the raw data.

library(tweenr)
library(magrittr)
data <- split(data, data$state)

data <- tween_state(data[[1]], data[[2]], 'cubic-in-out', 40) %>% 
  keep_state(10) %>% 
  tween_state(data[[1]],'cubic-in-out', 40) %>% 
  keep_state(10)

ggplot(data, aes(x = x, y = y)) + 
  geom_voronoi_tile(fill = 'grey', colour = 'black', bound = c(0, 1, 0, 1),
                    by.group = TRUE) + 
  geom_point() + 
  transition_manual(.frame)

Ah, we have finally arrived at the expected animation, but what a mess of a journey.

Who Plots Tesselation Anyway?

You may think the above example is laughably construed — this may even be the first time you’ve heard of voronoi tessellation. Hold my beer, because it is about to get even worse, even using a geom from ggplot2 itself. We’ll start with a plot again:

data <- data.frame(
  x = c(rnorm(50, mean = 5, sd = 3), rnorm(40, mean = 2, sd = 1)),
  y = c(rnorm(50, mean = -2, sd = 7), rnorm(40, mean = 6, sd = 4)),
  state = rep(c('a', 'b'), c(50, 40))
)

ggplot(data, aes(x = x, y = y)) +
  geom_contour(stat = 'density_2d') + 
  facet_wrap(~state)

And how might this look if we transition between a and b?

ggplot(data, aes(x = x, y = y)) +
  geom_contour(stat = 'density_2d') + 
  transition_states(state, transition_length = 3, state_length = 1) + 
  ease_aes('cubic-in-out')

Oh my… The problem is more or less the same as with the tessellation - the stat creates a primitive data representation (here paths and not polygons) and gganimate does its best at transitioning those, but in doing this the intermediary frames does not resemble contour lines at all, but more a bowl of spaghetti.

So, could we fix it in the same way? Just prepare the data beforehand. Well, not really as we run into the first problem discussed, way up at the beginning of the blog. There is really no meaningful way of transitioning 50 points into 40. We could remove 10 and move the remaining 40, but in terms of the derived density this would look messy (but let’s try anyway):

data2 <- split(data, data$state)
data2 <- tween_state(data2[[1]], data2[[2]], 'cubic-in-out', 40) %>% 
  keep_state(10) %>% 
  tween_state(data2[[1]], 'cubic-in-out', 40) %>% 
  keep_state(10)

ggplot(data2, aes(x = x, y = y)) +
  geom_contour(stat = 'density_2d') + 
  transition_manual(.frame)

It sort of does the right thing, but there is a noticeable switch in the density as the 10 points disappears and reappears.

What we really want to do is to calculate intermediary states of the 2D densities that the contours are derived from. The densities remove the point discrepancy while presenting a statistic that can be truthfully transitioned. Unfortunately the density data is only present ephemerally inside the stat function and is not accessible to the outside world (where gganimate resides). We could rewrite the density_2d stat to wait with the contour transformation:

StatDensityContour <- ggproto('StatDensityContour', StatDensity2d,
  compute_group = function (data, scales, na.rm = FALSE, h = NULL, contour = TRUE, 
                            n = 100, bins = NULL, binwidth = NULL) {
    StatDensity2d$compute_group(data, scales, na.rm = na.rm, h = h, contour = FALSE, 
                                n = n, bins = bins, binwidth = binwidth)
  },
  finish_layer = function(self, data, params) {
    names(data)[names(data) == 'density'] <- 'z'
    do.call(rbind, lapply(split(data, data$PANEL), function(d) {
      StatContour$compute_panel(d, scales = NULL, bins = params$bins, 
                                binwidth = params$binwidth)
    }))
  }
)

ggplot(data, aes(x = x, y = y)) +
  geom_contour(stat = 'density_contour') + 
  transition_states(state, transition_length = 3, state_length = 1) + 
  ease_aes('cubic-in-out')

What to make of this?

You might feel like Alice who has stepped through the looking glass at this point. Should you always second guess whatever gganimate is doing? Of course not. The choice of interpolating the statistically transformed data is sound and will just work for most of what you want to do. I certainly want to allow gganimate to expand based on the raw data as well, though this has proven harder than expected as it is often only a subset of aesthetics you want to expand at that state (remember the problem with unmapped colour/fill).

Even if early expansion gets implemented it will only solve problems such as the voronoi example. The last contour example runs deeper and touches upon the theory of the grammar of graphics and how ggplot2 implements it itself. Statistical transformations are often envisioned as a single operation, but can just as well be thought of as a chain of transformation (here density_2d -> contour). Alternatively one could think that it was the responsibility of the geom to calculate the contour lines. All-in-all the dichotomy of stat+geom is not so clear cut as it might appear, which has not been much of a problem when generating static plots. With the advent of gganimate this problem becomes more pertinent and I honestly don’t know the best way to address it. In a perfect world, all stats would return the data-state best fitted for expansion but this would require the finish_layer() hook to be more powerful, and would obviously require rewrites of a slew of geoms/stats. Then comes the question of whether it is even the responsibility of geom/stat developers to consider gganimate in the first place…

No matter the eventual solution to all this, I hope this post has made you a bit more aware of what happens to the data you plot as you passed it into ggplot2. Visualisations are after all first and foremost about data transformations…

Scico and the Colour Conundrum

Wed, 30 May 2018 00:00:00 GMT

I’m happy to once again announce the release of a package. This time it happens to be a rather unplanned and quick new package, which is fun for a change. The package in question is scico which provides access to the colour palettes developed by Fabio Crameri as well as scale functions for ggplot2 so they can be used there. As there is not a lot to talk about in such a simple package I’ll also spend some time discussing why choice of colour is important beyond aesthtic considerations, and discuss how the quality of a palette might be assesed.

An overview of the package

scico provides a total of 17 different continuous palettes, all of which are available with the scico() function. For anyone having used viridis() the scico() API is very familiar:

library(scico)
scico(15, palette = 'oslo')

##  [1] "#000000" "#09131E" "#0C2236" "#133352" "#19456F" "#24588E" "#3569AC"
##  [8] "#4D7CC6" "#668CCB" "#7C99CA" "#94A8C9" "#ABB6C7" "#C4C7CC" "#E1E1E1"
## [15] "#FFFFFF"

In order to get a quick overview of all the available palettes use the scico_palette_show() function:

scico_palette_show()

As can be seen, the collection consists of both sequential and diverging palettes, both of which have their own uses depending on the data you want to show. A special mention goes to the oleron palette which is intended for topographical height data in order to produce the well known atlas look. Be sure to center this palette around 0 or else you will end up with very misleading maps.

ggplot2 support is provided with the scale_[colour|color|fill]_scico() functions, which works as expected:

library(ggplot2)
volcano <- data.frame(
  x = rep(seq_len(ncol(volcano)), each = nrow(volcano)),
  y = rep(seq_len(nrow(volcano)), ncol(volcano)),
  Altitude = as.vector(volcano)
)

ggplot(volcano, aes(x = x, y = y, fill = Altitude)) + 
  geom_raster() + 
  theme_void() +
  scale_fill_scico(palette = 'turku')

This is more or less all there is for this package… Now, let’s get to the meat of the discussion.

What’s in a colour?

If you’ve ever wondered why we are not all just using rainbow colours in our plots (after all, rainbows are pretty…) it’s because our choice of colour scale have a deep impact on what changes in the underlying data our eyes can percieve. The rainbow colour scale is still very common and notoriously bad - see e.g. Borland & Taylor (2007), The Rainbow Colour Map (repeatedly) considered harmful, and How The Rainbow Color Map Misleads - due to two huge problems that are fundamental to designing good colour scales: Perceptual Uniformity and Colour Blind Safe. Both of these issues have been taken into account when designing the scico palettes, but let’s tackle them one by one:

Colour blindness

Up to 10% of north european males have the most common type of colour blindness (deuteranomaly, also known as red-green colour blindness), while the number is lower for other population groups. In addition, other, rarer, types of colour blindness exists as well. In any case, the chance that a person with a color vision deficiency will look at your plots is pretty high.

As we have to assume that the plots we produce will be looked at by people with color vision deficiency, we must make sure that the colours we use to encode data can be clearly read by them (ornamental colours are less important as they - hopefully - don’t impact the conclusion of the graphic). Thanksfully there are ways to simulate how colours are percieved by people with various types of colour blindness. Let’s look at the rainbow colour map:

library(pals)

pal.safe(rainbow, main = 'Rainbow scale')

As can be seen, there are huge areas of the scale where key tints disappears, making it impossible to correctly map colours back to their original data values. Put this in contrast to one of the scico palettes:

pal.safe(scico(100, palette = 'tokyo'), main = 'Tokyo scale')

While colour blindness certainly have an effect here, it is less detrimental as changes along the scale can still be percieved and the same tint is not occuring at multiple places.

Perceptual uniformity

While lack of colour blind safety “only” affects a subgroup of your audience, lack of perceptual uniformity affects everyone - even you. Behind the slightly highbrow name lies the criteria that equal jumps in the underlying data should result in equal jumps in percieved colour difference. Said in another way, every step along the palette should be percieved as giving the same amount of difference in colour.

One way to assess perceptual uniformity is by looking at small oscillations inside the scale. Let’s return to our favourite worst rainbow scale:

pal.sineramp(rainbow, main = 'Rainbow scale')

We can see that there are huge differences in how clearly the oscilations appear along the scale and around the green area they even disappears. In comparison the scico palettes produces much more even resuls:

pal.sineramp(scico(100, palette = 'tokyo'), main = 'Tokyo scale')

But wait - there’s more!

This is just a very short overview into the world of colour perception and how it affects information visualisation. The pals package contains more functions to assess the quality of colour palettes, some of which has been collected in an ensemble function:

pal.test(scico(100, palette = 'broc'), main = 'Broc scale')

It also has a vignette that explains in more detail how the different plots can be used to look into different aspects of the palette.

scico is also not the only package that provides well-designed, safe, colour palettes. RColorBrewer has been a beloved utility for a long time, as well as the more recent viridis. Still, choice is good and using the same palettes for prolonged time can make them seem old and boring, so the more the merrier.

A last honerable mention is the overview of palettes in R that Emil Hvitfeldt has put together. Not all of the palettes in it (the lions share actually) have been designed with the issues discussed above in mind, but sometimes thats OK - at least you now know how to assess the impact of your choice and weigh it out with the other considerations you have.

Always be weary of colours

lime v0.4: The kitten picture edition

Tue, 06 Mar 2018 00:00:00 GMT

I’m happy to report a new major release of lime has landed on CRAN. lime is an R port of the Python library of the same name by Marco Ribeiro that allows the user to pry open black box machine learning models and explain their outcomes on a per-observation basis. It works by modelling the outcome of the black box in the local neighborhood around the observation to explain and using this local model to explain why (not how) the black box did what it did. For more information about the theory of lime I will direct you to the article introducing the methodology.

New features

The meat of this release centers around two new features that are somewhat linked: Native support for keras models and support for explaining image models.

keras and images

J.J. Allaire was kind enough to namedrop lime during his keynote introduction of the tensorflow and keras packages and I felt compelled to support them natively. As keras is by far the most popular way to interface with tensorflow it is first in line for build-in support. The addition of keras means that lime now directly supports models from the following packages:

If you’re working on something too obscure or cutting edge to not be able to use these packages it is still possible to make your model lime compliant by providing predict_model() and model_type() methods for it.

keras models are used just like any other model, by passing it into the lime() function along with the training data in order to create an explainer object. Because we’re soon going to talk about image models, we’ll be using one of the pre-trained ImageNet models that is available from keras itself:

library(keras)
library(lime)
library(magick)

model <- application_vgg16(
  weights = "imagenet",
  include_top = TRUE
)
model

## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## input_1 (InputLayer)             (None, 224, 224, 3)           0           
## ___________________________________________________________________________
## block1_conv1 (Conv2D)            (None, 224, 224, 64)          1792        
## ___________________________________________________________________________
## block1_conv2 (Conv2D)            (None, 224, 224, 64)          36928       
## ___________________________________________________________________________
## block1_pool (MaxPooling2D)       (None, 112, 112, 64)          0           
## ___________________________________________________________________________
## block2_conv1 (Conv2D)            (None, 112, 112, 128)         73856       
## ___________________________________________________________________________
## block2_conv2 (Conv2D)            (None, 112, 112, 128)         147584      
## ___________________________________________________________________________
## block2_pool (MaxPooling2D)       (None, 56, 56, 128)           0           
## ___________________________________________________________________________
## block3_conv1 (Conv2D)            (None, 56, 56, 256)           295168      
## ___________________________________________________________________________
## block3_conv2 (Conv2D)            (None, 56, 56, 256)           590080      
## ___________________________________________________________________________
## block3_conv3 (Conv2D)            (None, 56, 56, 256)           590080      
## ___________________________________________________________________________
## block3_pool (MaxPooling2D)       (None, 28, 28, 256)           0           
## ___________________________________________________________________________
## block4_conv1 (Conv2D)            (None, 28, 28, 512)           1180160     
## ___________________________________________________________________________
## block4_conv2 (Conv2D)            (None, 28, 28, 512)           2359808     
## ___________________________________________________________________________
## block4_conv3 (Conv2D)            (None, 28, 28, 512)           2359808     
## ___________________________________________________________________________
## block4_pool (MaxPooling2D)       (None, 14, 14, 512)           0           
## ___________________________________________________________________________
## block5_conv1 (Conv2D)            (None, 14, 14, 512)           2359808     
## ___________________________________________________________________________
## block5_conv2 (Conv2D)            (None, 14, 14, 512)           2359808     
## ___________________________________________________________________________
## block5_conv3 (Conv2D)            (None, 14, 14, 512)           2359808     
## ___________________________________________________________________________
## block5_pool (MaxPooling2D)       (None, 7, 7, 512)             0           
## ___________________________________________________________________________
## flatten (Flatten)                (None, 25088)                 0           
## ___________________________________________________________________________
## fc1 (Dense)                      (None, 4096)                  102764544   
## ___________________________________________________________________________
## fc2 (Dense)                      (None, 4096)                  16781312    
## ___________________________________________________________________________
## predictions (Dense)              (None, 1000)                  4097000     
## ===========================================================================
## Total params: 138,357,544
## Trainable params: 138,357,544
## Non-trainable params: 0
## ___________________________________________________________________________

The vgg16 model is an image classification model that has been build as part of the ImageNet competition where the goal is to classify pictures into 1000 categories with the highest accuracy. As we can see it is fairly complicated.

In order to create an explainer we will need to pass in the training data as well. For image data the training data is really only used to tell lime that we are dealing with an image model, so any image will suffice. The format for the training data is simply the path to the images, and because the internet runs on kitten pictures we’ll use one of these:

img <- image_read('https://www.data-imaginist.com/assets/img/kitten.jpg')
img_path <- file.path(tempdir(), 'kitten.jpg')
image_write(img, img_path)
plot(as.raster(img))

Figure 1: Photo by Paul on Unsplash

As with text models the explainer will need to know how to prepare the input data for the model. For keras models this means formatting the image data as tensors. Thankfully keras comes with a lot of tools for reshaping image data:

image_prep <- function(x) {
  arrays <- lapply(x, function(path) {
    img <- image_load(path, target_size = c(224,224))
    x <- image_to_array(img)
    x <- array_reshape(x, c(1, dim(x)))
    x <- imagenet_preprocess_input(x)
  })
  do.call(abind::abind, c(arrays, list(along = 1)))
}
explainer <- lime(img_path, model, image_prep)

We now have an explainer model for understanding how the vgg16 neural network makes its predictions. Before we go along, lets see what the model think of our kitten:

res <- predict(model, image_prep(img_path))
imagenet_decode_predictions(res)

## [[1]]
##   class_name class_description      score
## 1  n02124075      Egyptian_cat 0.48913878
## 2  n02123045             tabby 0.15177219
## 3  n02123159         tiger_cat 0.10270492
## 4  n02127052              lynx 0.02638111
## 5  n03793489             mouse 0.00852214

So, it is pretty sure about the whole cat thing. The reason we need to use imagenet_decode_predictions() is that the output of a keras model is always just a nameless tensor:

dim(res)

## [1]    1 1000

dimnames(res)

## NULL

We are used to classifiers knowing the class labels, but this is not the case for keras. Motivated by this, lime now have a way to define/overwrite the class labels of a model, using the as_classifier() function. Let’s redo our explainer:

model_labels <- readRDS(system.file('extdata', 'imagenet_labels.rds', package = 'lime'))
explainer <- lime(img_path, as_classifier(model, model_labels), image_prep)

There is also an as_regressor() function which tells lime, without a doubt, that the model is a regression model. Most models can be introspected to see which type of model they are, but neural networks doesn’t really care. lime guesses the model type from the activation used in the last layer (linear activation == regression), but if that heuristic fails then as_regressor()/as_classifier() can be used.

We are now ready to poke into the model and find out what makes it think our image is of an Egyptian cat. But… first I’ll have to talk about yet another concept: superpixels (I promise I’ll get to the explanation part in a bit).

In order to create meaningful permutations of our image (remember, this is the central idea in lime), we have to define how to do so. The permutations needs to be substantial enough to have an impact on the image, but not so much that the model completely fails to recognise the content in every case - further, they should lead to an interpretable result. The concept of superpixels lends itself well to these constraints. In short, a superpixel is a patch of an area with high homogeneity, and superpixel segmentation is a clustering of image pixels into a number of superpixels. By segmenting the image to explain into superpixels we can turn area of contextual similarity on and off during the permutations and find out if that area is important. It is still necessary to experiment a bit as the optimal number of superpixels depend on the content of the image. Remember, we need them to be large enough to have an impact but not so large that the class probability becomes effectively binary. lime comes with a function to assess the superpixel segmentation before beginning the explanation and it is recommended to play with it a bit — with time you’ll likely get a feel for the right values:

# default
plot_superpixels(img_path)

# Changing some settings
plot_superpixels(img_path, n_superpixels = 200, weight = 40)

The default is set to a pretty low number of superpixels — if the subject of interest is relatively small it may be necessary to increase the number of superpixels so that the full subject does not end up in one, or a few superpixels. The weight parameter will allow you to make the segments more compact by weighting spatial distance higher than colour distance. For this example we’ll stick with the defaults.

Be aware that explaining image models is much heavier than tabular or text data. In effect it will create 1000 new images per explanation (default permutation size for images) and run these through the model. As image classification models are often quite heavy, this will result in computation time measured in minutes. The permutation is batched (default to 10 permutations per batch), so you should not be afraid of running out of RAM or hard-drive space.

explanation <- explain(img_path, explainer, n_labels = 2, n_features = 20)

The output of an image explanation is a data frame of the same format as that from tabular and text data. Each feature will be a superpixel and the pixel range of the superpixel will be used as its description. Usually the explanation will only make sense in the context of the image itself, so the new version of lime also comes with a plot_image_explanation() function to do just that. Let’s see what our explanation have to tell us:

plot_image_explanation(explanation)

We can see that the model, for both the major predicted classes, focuses on the cat, which is nice since they are both different cat breeds. The plot function got a few different functions to help you tweak the visual, and it filters low scoring superpixels away by default. An alternative view that puts more focus on the relevant superpixels, but removes the context can be seen by using display = ‘block’:

plot_image_explanation(explanation, display = 'block', threshold = 0.01)

While not as common with image explanations it is also possible to look at the areas of an image that contradicts the class:

plot_image_explanation(explanation, threshold = 0, show_negative = TRUE, fill_alpha = 0.6)

As each explanation takes longer time to create and needs to be tweaked on a per-image basis, image explanations are not something that you’ll create in large batches as you might do with tabular and text data. Still, a few explanations might allow you to understand your model better and be used for communicating the workings of your model. Further, as the time-limiting factor in image explanations are the image classifier and not lime itself, it is bound to improve as image classifiers becomes more performant.

Grab back

Apart from keras and image support, a slew of other features and improvements have been added. Here’s a quick overview:

All explanation plots now include the fit of the ridge regression used to make the explanation. This makes it easy to assess how good the assumptions about local linearity are kept.
When explaining tabular data the default distance measure is now ‘gower’ from the gower package. gower makes it possible to measure distances between heterogeneous data without converting all features to numeric and experimenting with different exponential kernels.
When explaining tabular data numerical features will no longer be sampled from a normal distribution during permutations, but from a kernel density defined by the training data. This should ensure that the permutations are more representative of the expected input.

Wrapping up

This release represents an important milestone for lime in R. With the addition of image explanations the lime package is now on par or above its Python relative, feature-wise. Further development will focus on improving the performance of the model, e.g. by adding parallelisation or improving the local model definition, as well as exploring alternative explanation types such as anchor.

Happy Explaining!