2018-12-09

Smartly select and mutate data frame columns, using dict

Motivation

The dplyr functions select and mutate nowadays are commonly applied to perform data.frame column operations, frequently combined with magrittrs forward %>% pipe. While working well interactively, however, these methods often would require additional checking if used in “serious” code, for example, to catch column name clashes.

In principle, the container package provides a dict-class (resembling Pythons dict type), which allows to cover these issues more easily. In its very recent update, the container package for this reason gained an S3 method interface plus functions to convert back and forth between dict and data.frame. This can be used to extend the set of data.frame column operations and in this post I will show how and when they can serve as a useful alternative to mutate and select.

To keep matters simple, we use a tiny data table.

data <- data.frame(x = c(0.2, 0.5), y = letters[1:2])
data
##     x y
## 1 0.2 a
## 2 0.5 b

Column operations

Add

Let’s add a column using mutate.

library(dplyr)
n <- nrow(data)

data %>% 
    mutate(ID = 1:n)
##     x y ID
## 1 0.2 a  1
## 2 0.5 b  2

For someone not familar with the tidyverse, this code block might read somewhat odd as the column is added and not mutated. To add a column using dict simply use add.

library(container)
data %>% 
    as.dict() %>% 
    add("ID", 1:n) %>%
    as.data.frame()
##     x y ID
## 1 0.2 a  1
## 2 0.5 b  2

The intended add-operation is stated more clearly, but on the downside we also had to add some overhead. Of course, since this has to be done only at the beginning and at the end of the pipe, it will be less of an issue if multiple dict-operations are performed in between. Next, instead of ID, let’s add another numeric column y, which happens to “name-clash” with the already existing column.

data %>% 
    mutate(y = rnorm(n))
##     x         y
## 1 0.2 1.5295729
## 2 0.5 0.4821652

Ooops - we have accidently overwritten the initial y-column. While this was easy to see here, it may not if the data.frame has a lot of columns or if column names are created automatically as part of some script. To catch this, usually some overhead is required, too.

if ("y" %in% colnames(data)) {
    stop("column y already exists")
} else {
    data %>% 
        mutate(y = rnorm(n))
}
## Error in eval(expr, envir, enclos): column y already exists

Let’s see the dict-operation in comparison.

data %>% 
    as.dict() %>% 
    add("y", rnorm(n)) %>%
    as.data.frame()
## Error in x$add(key, value): key 'y' already in Dict

The name clash is catched by default and the overhead does not look so silly anymore. As a bonus, the error message still provides information about the originally intended add-operation.

Modify

If the intend was indeed to overwrite the value, the dict-function setval can be used.

data %>% 
    as.dict() %>% 
    setval("y", rnorm(n)) %>%
    as.data.frame()
##     x          y
## 1 0.2  0.5315501
## 2 0.5 -0.1573612

As we saw above, if a column does not exist, mutate silently creates it for you. If this is not what you want, which means, you want to make sure something is overwritten, again, a workaround is needed.

if ("ID" %in% colnames(data)) {
    data %>% 
        mutate(ID = 1:n)    
} else {
    stop("column ID not in data.frame")
}
## Error in eval(expr, envir, enclos): column ID not in data.frame

Once again, the workaround is already “built-in” in the dict-framework.

data %>% 
    as.dict() %>% 
    setval("ID", 1:n) %>%
    as.data.frame()
## Error in x$set(key, value, add): key 'ID' not in Dict

After all, the intend of the mutate function actually would be something like: overwrite a column, or, create it if it does not exist. If desired, this behaviour can be expressed within the dict-framework as well.

data %>% 
    as.dict() %>% 
    setval("ID", 1:n, add=TRUE) %>%
    as.data.frame()
##     x y ID
## 1 0.2 a  1
## 2 0.5 b  2

Remove

A common tidyverse approach to remove a column is based on the select function. One corresponding dict-function is remove.

data %>% 
        select(-"y")
##     x
## 1 0.2
## 2 0.5

data %>% 
    as.dict() %>% 
    remove("y") %>% 
    as.data.frame()
##     x
## 1 0.2
## 2 0.5

Let’s see what happens if the column does not exist in the first place.

data %>%
    select(-"ID")
## Error: Unknown column `ID`

data %>% 
    as.dict() %>% 
    remove("ID") %>% 
    as.data.frame()
## Error in x$remove(key): key 'ID' not in Dict

Again, we obtain a slightly more informative error message with dict. Assume we want the column to be removed if it exist but otherwise silently ignore the command, for example:

if ("ID" %in% colnames(data)) {
    data %>%
        select(-"ID")    
}

You may have expected this by now - the dict-framework provides a straigh-forward solution, namely, the discard function:

data %>% 
    as.dict() %>% 
    discard("ID") %>% 
    discard("y") %>%
    add("z", c("Hello", "World")) %>%
    as.data.frame()
##     x     z
## 1 0.2 Hello
## 2 0.5 World

Benchmark

The required additional code lines are limited but what about the computational overhead? To examine this, we benchmark some column operations using the famous ‘iris’ data set. As a hallmark reference we will also bring the data.table framework to the competition.

set.seed(123)
library(microbenchmark)
library(ggplot2)
library(data.table)
data <- iris
n <- nrow(data)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

For the benchmark, we add one, transform one and finally delete one column.

bm <- microbenchmark(control = list(order="inorder"), times = 100,
    dplyr = data %>% 
        mutate(ID = 1:n) %>%
        mutate(Species = as.character(Species)) %>%
        select(-Sepal.Width),
    
    dict = data %>%
        as.dict() %>%
        add("ID", 1:n) %>%
        setval(., "Species", as.character(.$get("Species"))) %>%
        remove("Sepal.Width") %>%
        as.data.frame(),
    
    `[.data.table` = data %>% 
        as.data.table() %>%
        .[, ID := 1:n] %>%
        .[, Species := as.character(Species)] %>%
        .[, Sepal.Width := NULL]
)
autoplot(bm) + theme_bw()

Somewhat surprisingly maybe, the dict-implementation is closer to the data.table than to the dplyr performance. Let’s examine each operation in more detail.

bm <- microbenchmark(control = list(order="inorder"), times = 100,
    tbl <- mutate(data, ID = 1:n),
    tbl <- mutate(tbl, Species = as.character(Species)),
    tbl <- select(tbl, -Sepal.Width),
    
    dic <- as.dict(data),
    add(dic, "ID", 1:n),
    setval(dic, "Species", as.character(dic$get("Species"))),
    remove(dic, "Sepal.Width"),
    dic <- as.data.frame(dic),
    
    dt <- as.data.table(data),
    dt[, ID := 1:n],
    dt[, Species := as.character(Species)],
    dt[, Sepal.Width := NULL]
)
autoplot(bm) + theme_bw()

Apparently, the mutate and select operations are the slowest in comparison, I think, because both the dict and data.table approach work by reference while probably some copying is done in the dplyr pipe. We also see that the dict-approach spends most of the computation time for the transformation back and forth between a dict and a data.frame while the actual column operations seem very efficient, even more efficient than that of data.table. This certainly came as a surprise to me, as the focus when developing the container package has never been on speed but rather on providing a concise data structure. Internally the dict simply consists of a named list, so I guess this speaks for the efficiency of base R list operations. Having said that, I found that the data.table code can be further improved by avoiding the overhead of the [.data.table operator and instead use the built-in set function:

bm <- microbenchmark(control = list(order="inorder"), times = 100,
    dplyr = data %>% 
        mutate(ID = 1:n) %>%
        mutate(Species = as.character(Species)) %>%
        select(-Sepal.Width),
    
    dict = data %>%
        as.dict() %>%
        add("ID", 1:n) %>%
        setval(., "Species", as.character(.$get("Species"))) %>%
        remove("Sepal.Width") %>%
        as.data.frame(),
    
    `[.data.table` = data %>% 
        as.data.table() %>%
        .[, ID := 1:n] %>%
        .[, Species := as.character(Species)] %>%
        .[, Sepal.Width := NULL],
    
    data.table.set = data %>% 
        as.data.table() %>%
        set(., j= "ID", value = 1:n) %>%
        set(., j = "Species", value = as.character(.[["Species"]])) %>%
        set(., j = "Sepal.Width", value = NULL)
)
autoplot(bm) + theme_bw()

This puts things back into perspective, I guess :) It might also be interesting to know, how much of the computation time is spent on the non-standard evaluation part of the dplyr and [.data.table implementation, but that’s probably a topic on its own.

Summary

Accidently overwriting existing data columns leads to nasty bugs. The presented workflow allows to increase both reliability and precision of standard data frame column manipulation at very little cost. The intended column operations can be expressed more clearly and, in case of failures, informative error messages are provided by default. As a result, the dict-framework may serve as a useful supplement to “interactive piping”.

4 comments:

  1. Interesting post, thanks. I did not know about the container package. I like the add and setval commands. R is a lot about getting used to commands, but I think add, setval and remove are more intuitive.
    A small note: There is no need to create an object for the row number. The shortcut is
    data %>%
    mutate(ID = 1:n())

    ReplyDelete
    Replies
    1. Unfortunately, I did not recognize you comment until now, so (a delayed) thanks for your input :) I am aware of the n() function, but deliberately did not use it to avoid confusion for readers not very familiar with dplyr, which might be a rare species these days, who knows :)

      Delete
  2. For true fans of this thread I will address is a free online! data entry appraisal

    ReplyDelete
  3. { Inconsistency in output quality: If the provider {you have chosen|you've chosen|you've selected|you have selected} is inexperienced and lacks consistency, {then it|it|this|that} {might lead to|could trigger|might trigger|may cause} problems {such as|for example|including|like} delayed submission of completed projects, processed files without accuracy and quality, inappropriate assignment of responsibilities, {lack of communication|no communication|poor communication} {and so|and thus|therefore|so} on| While the job profile {might seem|may appear|may seem|might appear} simple {it does|it will|it can|it lets you do} {in fact|actually|in reality|the truth is} {require a|need a|demand a|have to have a} certain {degree of|amount of|level of|a higher level} exactness {and an|as well as an|plus an|with an} eye for detail| My writing {is focused|is concentrated|is targeted|concentrates} {more on|more about|read more about|on} {the industry|the|a|that is a} {and quality of|and excellence of|superiority} work, not the worker| By continues monitoring the hurdles and solving it, {one can|it's possible to|you can|one can possibly} easily {increase the|boost the|raise the|improve the} productivity of business| Decline {in the|within the|inside the|inside} quality of service and delay {in the|within the|inside the|inside} execution and delivery of processes are some {of the|from the|with the|in the} risks involved, {besides the|aside from the|in addition to the|apart from the} risk {to the|towards the|for the|on the} security {of the|from the|with the|in the} data and privacy and cost-related risks| The {service provider|company|supplier|vendor} {should also|also needs to|must also|also need to} volunteer {a variety of|a number of|many different|various} profits concerning formulas {of data|of information|of knowledge|of internet data} transmission, turnaround etc}. { A lot of companies are fine with admitting this, but {others are|other medication is|other people are} {not so|not too|not|less than} sure, primarily {because this|as this|since this|simply because this} may put people {off the|from the|off of the|over} service| Such measures would {keep your|keep the|maintain your|maintain} sensitive documents from falling {into the|in to the|to the|in the} hands of unauthorized personnel| When you outsource {to an|for an|to a|with an} experienced BPO company, {they would|they'd|they might|they will} manage these risks professionally {as well as|in addition to|along with|and also} plan and implement appropriate {strategies to|ways of|ways to|methods to} avoid them in future| Outsourcing data entry is most helpful term {for all|for those|for many|for all those} these organizations| With the help of such information, {you can|you are able to|it is possible to|you'll be able to} {improve on|enhance|make improvements to} customer targeting| If you think {you are|you're|you might be|you happen to be} proficient enough in installing the payment processor {on your|in your|on your own|on the} website {on your|in your|on your own|on the} own, {you should not|you shouldn't|you ought not|it's not necassary to} hesitate doing it}. data entry per page rate

    ReplyDelete

My R style guide

R-coding-style-guide.utf8.md This is my take on an R style guide. It covers a lot ...