Profiling

Refs:

Advanced R by Hadley Wickham

# devtools::install_github("hadley/lineprof")
library(lineprof)

Profiling code is necessary to find bottlenecks and try to optimize the use of time and memory by removing them.

code = '
  read_delim <- function(file, header = TRUE, sep = ",") {
    # Determine number of fields by reading first line
    first <- scan(file, what = character(1), nlines = 1,
      sep = sep, quiet = TRUE)
    p <- length(first)
  
    # Load all fields as character vectors
    all <- scan(file, what = as.list(rep("character", p)),
      sep = sep, skip = if (header) 1 else 0, quiet = TRUE)
  
    # Convert from strings to appropriate types (never to factors)
    all[] <- lapply(all, type.convert, as.is = TRUE)
  
    # Set column names
    if (header) {
      names(all) <- first
    } else {
      names(all) <- paste0("V", seq_along(all))
    }
  
    # Convert list into data frame
    as.data.frame(all)
  }
'

write(code, "source.R")
source("source.R")  # this is necessary for lineprof to work
library(ggplot2)
write.csv(diamonds, "diamonds.csv", row.names = FALSE)
l <- lineprof(read_delim("diamonds.csv"))
l

## Reducing depth to 2 (from 16)

##      time alloc release dups                                      ref
## 1   0.005 0.018   0.000    2                        "lazyLoadDBfetch"
## 2   0.001 0.003   0.000    0                                   "scan"
## 3   0.020 0.005   0.000   62                        c("scan", "file")
## 4   0.004 0.006   0.000    1                                   "scan"
## 5   0.001 0.003   0.000    1                       c("scan", "close")
## 6   0.001 0.007   0.000    0                                   "scan"
## 7   0.022 0.003   0.000   62                        c("scan", "file")
## 8   0.001 0.003   0.000    1                     c("scan", "as.list")
## 9   0.001 0.001   0.000    1                   c("scan", "identical")
## 10 10.991 2.359   0.890    0                                   "scan"
## 11  0.007 0.002   0.000    1                       c("scan", "close")
## 12  0.002 0.004   0.000    0                 c("lapply", "match.fun")
## 13  0.001 0.003   0.000    0                                 "lapply"
## 14  2.709 0.227   0.337   15                       c("lapply", "FUN")
## 15  0.001 0.001   0.000    0                                 "lapply"
## 16  3.323 0.344   0.000   18                       c("lapply", "FUN")
## 17  0.001 0.002   0.000    1                             character(0)
## 18  0.001 0.001   0.000    0                          "as.data.frame"
## 19  0.008 0.022   0.000    0    c("as.data.frame", "lazyLoadDBfetch")
## 20  0.361 0.931   0.524  294 c("as.data.frame", "as.data.frame.list")
## 21  0.001 0.000   0.000    0                             character(0)
##                                 src
## 1  lazyLoadDBfetch                 
## 2  scan                            
## 3  scan/file                       
## 4  scan                            
## 5  scan/close                      
## 6  scan                            
## 7  scan/file                       
## 8  scan/as.list                    
## 9  scan/identical                  
## 10 scan                            
## 11 scan/close                      
## 12 lapply/match.fun                
## 13 lapply                          
## 14 lapply/FUN                      
## 15 lapply                          
## 16 lapply/FUN                      
## 17                                 
## 18 as.data.frame                   
## 19 as.data.frame/lazyLoadDBfetch   
## 20 as.data.frame/as.data.frame.list
## 21

A good way to see the results is to use an interactive explorer using the shiny package:

library(shiny)
# opens a web page that shows your source code annotated with information about how long each line took to run
shine(l)

The t column visualises how much time in seconds is spent on each line.
The a is the memory (in megabytes) allocated by that line of code.
The r is the memory (in megabytes) released by that line of code (this may vary, since it depends on the garbage collector).
The d is the number of vector duplications that occurred. A vector duplication occurs when R copies a vector as a result of its copy on modify semantics.

To see the values just hover the mouse over the required bar.

Profiling

João Neto

October 2014