Category: Debugging

10/27/2015

One potential obstacle to using Rmarkdown with computationally-intensive projects is that waiting for the whole thing to run again every time you make a small change to your document is a pain. So here are a few workarounds.

Option 1: Put commands that only need to be run once in an eval=FALSE chunk It’s usually important to include commands to download and unpack data with your script, so that it’s clear which data you were working with. However, you probably never need to run those commands more than once. In this case, the best options is probably to put them inside an r chunk, with the eval option set to FALSE:

```{r, eval=FALSE}

#download huge dataset

```

Note that if someone else downloads and runs your file, they will need to set the eval option to TRUE in order to get the data.

Option 2: Use caching To deal with this very problem, chunks in R markdown have a cache option. If you set it to TRUE, the results of your R code will be saved the first time the chunk is run, and reloaded every time after that. If you make any changes to that chunk, then it will be re-run. If you also set cache.comments to TRUE, then your code will only be re-run if you actually change the code itself, rather than just the comments. If you care, you can also control where the data is stored with the cache.path option. Here’s an example:

```{r, cache=TRUE, cache.comments=TRUE}

#code that takes forever to run

#I can add this comment later without rerunning the code that takes forever!

```

Option 3: Load previous results conditionally Caching is all well and good, but what if you were playing around in the console (rather than your Rmarkdown file), and created some object that took forever to create. Or what if the script that you’re documenting can’t even run on your computer (for instance, maybe you had to run it on the HPCC instead)? In this case, your best bet is to save the object from wherever you created it, and then have your Rmarkdown code look for the file you stored it in before trying to run the code again. This idea was originally suggested in this stackoverflow post. Here’s an example of how it might work:

First you store the object into a file:

my_huge_object <- intensive_calculation(giant_data_file)

#Save the object into huge_object_file.Rda. ".Rda" is a file extension commonly used
#for storing objects that R knows how to interpret.
save(my_huge_object, "huge_object_file.Rda")

Then make sure this file is in the current working directory and add something like this to the part of your Rmarkdown script where you need to use my_huge_object:

```{r}

if(file.exists("huge_object_file.Rda")){
  load("huge_object_file.Rda")} else {
  
  #if whoever is running this code doesn't have the data file stored,
  #let's hope they're running this on a powerful computer
my_huge_object <- intensive_calculation(giant_data_file)
}

```

By following these approaches, you can make sure that your code is reproducible without needing to wait for it to reproduce itself every time you knit the file!

10 Comments

9/30/2015

0 Comments

For the most part, R pacakges have pretty great documentation on how to use them. When there are areas where the documentation is unclear, there are generally lots of people on the internet who have had the same problem and figured out the solution. But occasionally you come across a problem that no one else seems to have had. The internet being as vast as it is, 90% of the time this is an indication that you have made a typo or something. Sometimes, though, you’re just actually the first person to have encountered this problem (or been sufficiently determined to solve it).

So what are you supposed to do? Here are some steps that you can take to debug your code:

Step 1: Try to re-create the problem in the simplest possible way possible. You probably encountered it while you were doing some very specific thing to your data in the midst of all sorts of complicated transformations and plotting. That means that it could be the result of your data, transformations, plotting, an underlying problem with the package you’re using, or any combination thereof. So make a completely new script where you do the thing that isn’t working in as much isolation as possible. Instead of using your actual data, it’s often a good idea to placeholder data. A dataframe composed entirely of 1s is usually sufficient, unless you’re using a function that depends on you having actual variation in your data. In that case, you can just choose a series of simple placeholder numbers, or get fancy and fill in your data.frame with randomly generated data from a function like rnorm.

Okay, so now you’ve created a simpler context to test your problem. Great. One of two things should have happened: - You are no longer getting the error: Yay! You have something to go off of! Start adding the actual complexity of your problem back in gradually and see at what point it breaks. - You are still having the same problem. This is a sign of a more serious problem. Definitely google any error messages you’re getting, or else a general description of the problem. If you’re not finding anything, then specifically search stackoverflow with the same criteria. Often, the best way to search stackoverflow is to start asking a question and look at the list of potentially related questions it suggests. This also situates you well to ask the question, if none of the related questions answer it!

Still don’t have a solution? Congratulations! You have probably found an obscure bug. Sometimes the only option is to dive into the code. This is absolutely a measure of last resort, but following are some thoughts on how to do it as painlessly as possible.

Step 2: Get the code. Assuming the thing that’s giving you problems is a function (something that takes input through parentheses and returns output), this is actually pretty straightforward. If you type the name of the function into the console withut the parentheses, it will print out all of the code for that function. As an example, we can try this on a common function: read.csv:

read.csv

## function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", 
##     fill = TRUE, comment.char = "", ...) 
## read.table(file = file, header = header, sep = sep, quote = quote, 
##     dec = dec, fill = fill, comment.char = comment.char, ...)
## <bytecode: 0x287a350>
## <environment: namespace:utils>

This tells us that, under the hood, read.csv is actually just one line of code that calls a different function, read.table. Let’s look at something a little more complex:

library(GISTools)

## Loading required package: maptools
## Loading required package: sp
## Checking rgeos availability: TRUE
## Loading required package: RColorBrewer
## Loading required package: MASS
## Loading required package: rgeos
## rgeos version: 0.3-12, (SVN revision 498)
##  GEOS runtime version: 3.4.2-CAPI-1.8.2 r3921 
##  Linking to sp version: 1.2-0 
##  Polygon checking: TRUE

north.arrow

## function (xb, yb, len, lab = "NORTH", cex.lab = 1, tcol = "black", 
##     ...) 
## {
##     s <- len
##     arrow.x = c(-1, 1, 1, 1.5, 0, -1.5, -1, -1)
##     arrow.y = c(0, 0, 2, 2, 4, 2, 2, 0)
##     polygon(xb + arrow.x * s, yb + arrow.y * s, ...)
##     text(xb, yb - strheight(lab, cex = cex.lab), lab, cex = cex.lab, 
##         col = tcol)
## }
## <environment: namespace:GISTools>

So, north.arrow is basically drawing a polygon based on the coordinates you give it and writing some text under it.

Those were pretty short examples. If you’re lucky enough that the function you’re having a problem with is this simple, you can probably just run it line by line in the console to see what it’s doing and why that’s different than what you expect it to be doing. Would that it were always so simple.

Step 3: Search for relevant words.

Odds are, the function you’re dealing with is long and complicated and interacts with lots of other things that you have no desire to take the time to understand. Your console should have some sort of search function (often ctrl-F), that will let you enter a string of text to search for. If your problem has to do with a specific argument to the function, try searching for that

For example, say I’ve got this function that says I can pass it a string of text and it will print that text below the shape it draws on my plot. But when I try to do that (for instance, by typing example_function(my_text="some text!")), nothing happens. So I take a look at the code. It’s long and I don’t want to read all of it, so I try searching for the name of the argument, which happens to be my_text, in this case. It turns out that in the entire function, the name of this argument only shows up twice: in the first line (the function definition), and in an if statement near the end of the function:

function (d, xy = NULL, type = "line", divs = 2, my_text = "", lonlat = NULL, ...) 
{
  ...
  
  if (my_text != "") {
            adj[2] <- -adj[2]
            text(xy[1] + (0.5 * dd), xy[2], labels = below, adj = adj, 
                ...)
        }
  ...
}

(the rest of the function omitted for brevity)

So now I know that little tiny bit of the function is all I really need to worry about to figure out what’s going on. But how?

Step 4: Add print statements.

R has a lovely function called print(), which will write the value of whatever you put in the parenthesis to your console. This can be used to great effect in debugging code. There are two primary ways to debug with print statements:

print strings of text telling you what lines of code the function even got to.
print out the values of variables so that you can make sure they’re what you expect.

In order to edit the function and add print statements, the best thing to do is open a new script file and copy and paste the code for the function into it from your console. You can give it a new name to more easily run it:

test_copy <- function (d, xy = NULL, type = "line", divs = 2, my_text = "", lonlat = NULL, ...) 
{
  #[function body here]
}

#you can run this function with
test_copy(my_text="hi")

In the example above, a first series of print statements might look like this:

function (d, xy = NULL, type = "line", divs = 2, my_text = "", lonlat = NULL, ...) 
{
  print(my_text) #make sure that I have successfully passed the argument to the function
  ...
  
  print("Right before if statement") #if I see this message, it means that R actually got to the if statement
  print(my_text) #make sure my_text still has the value I expect
  if (my_text != "") {
            print("in if block") #if I see this message, I know R actually executed the code that uses my_text
            adj[2] <- -adj[2]
            text(xy[1] + (0.5 * dd), xy[2], labels = below, adj = adj, 
                ...)
        }
  ...
}

Let’s say I run test_copy(my_text="hi"), and I get the following output:

hi

That tells me that the code got to the first print statement, printed the value of the argument (which was exactly what it should have been), and then didn’t get to either of the other two print statements. Well, that explains why setting my_text="hi" isn’t doing anything! R is never actually getting to the part of the function that pays attention to it! But why? To find out, we can add more print statements, each indicating how far into the code they are, and see at what point we stop seeing their output. A good way to choose where to put these print statements is to look for the start of new curly-brace-enclosed blocks of code, particularly those that follow if-statements. This should give you a picture of what parts of the code are being executed. You can then read the relevant if-statements to see what conditions would need to be true to take a different path.

In this example, we find out that the code that uses my_text is inside a curly-brace-enclosed block that is controlled by this if-statement: if (type == "bar"):

function (d, xy = NULL, type = "line", divs = 2, my_text = "", lonlat = NULL, ...) 
{
  ... #complicated code here
  
  if (type == "line") {
        ... #more code we don't actually care about
    }
  else if (type == "bar") {
        ... #also lots of code
        if (my_text != "") {
            adj[2] <- -adj[2]
            text(xy[1] + (0.5 * dd), xy[2], labels = below, adj = adj, 
                ...)
        }
    }
}

That means that my_text is only used if type is equal to “bar”. Since the default value appears to be “line”, that explains why I’m having problems! It also tells us that we can fix this problem by either moving the block of code that uses my_text, or by just setting type equal to bar.

That’s one specific example of debugging - what you need to do will vary wildly based on what problem you have to be encountering. However, I think that this overall outline of how to approach debugging hard problems generalizes pretty well from bug to bug, so I hope it can be helpful to others!

0 Comments

Spatial Ecology & R
(search via "category" below)

Using Rmarkdown without re-running long commands

Emily Dolson

October 3, 2015

Seeking help online

Allison Sussman

October 16, 2015

Debugging R functions

Emily Dolson

09/23/2015

Spatial Ecology @ MSU

Category

Archive

Spatial Ecology & R(search via "category" below)

Using Rmarkdown without re-running long commands

Emily Dolson

October 3, 2015

Seeking help online

Allison Sussman

October 16, 2015

Debugging R functions

Emily Dolson

09/23/2015

Spatial Ecology @ MSU

Category

Archive

Spatial Ecology & R
(search via "category" below)