performance - In R, how do you loop over the rows of a data frame really fast? -
suppose have data frame many rows , many columns.
the columns have names. want access rows number, , columns name.
for example, 1 (possibly slow) way loop on rows is
for (i in 1:nrow(df)) { print(df[i, "column1"]) # more things data frame... }
another way create "lists" separate columns (like column1_list = df[["column1"]
), , access lists in 1 loop. approach might fast, inconvenient if want access many columns.
is there fast way of looping on rows of data frame? other data structure better looping fast?
i think need make full answer because find comments harder track , lost 1 comment on this... there example nullglob demonstrates differences among for, , apply family functions better other examples. when 1 makes function such slow that's speed consumed , won't find differences among variations on looping. when make function trivial can see how looping influences things.
i'd add members of apply family unexplored in other examples have interesting performance properties. first i'll show replications of nullglob's relative results on machine.
n <- 1e6 system.time(for(i in 1:n) sini[i] <- sin(i)) user system elapsed 5.721 0.028 5.712 lapply runs faster same result system.time(sini <- lapply(1:n,sin)) user system elapsed 1.353 0.012 1.361
he found sapply slower. here others weren't tested.
plain old apply matrix version of data...
mat <- matrix(1:n,ncol =1),1,sin) system.time(sini <- apply(mat,1,sin)) user system elapsed 8.478 0.116 8.531
so, apply() command substantially slower loop. (for loop not slowed down appreciably if use sin(mat[i,1]).
another 1 doesn't seem tested in other posts tapply.
system.time(sini <- tapply(1:n, 1:n, sin)) user system elapsed 12.908 0.266 13.589
of course, 1 never use tapply way , it's utility far beyond such speed problem in cases.
Comments
Post a Comment