Julian.Taylor at csiro.au
2010-Jul-07 01:04 UTC
[Rd] Large discrepancies in the same object being saved to .RData
Hi developers, After some investigation I have found there can be large discrepancies in the same object being saved as an external "xx.RData" file. The immediate repercussion of this is the possible increased size of your .RData workspace for no apparent reason. The function and its three scenarios below highlight these discrepancies. Note that the object being returned is exactly the same in each circumstance. The first scenario simply loops over a set of lm() models from a simulated set of data. The second adds a reasonably large matrix calculation within the loop. The third highlights exactly where the discrepancy lies. It appears that when the object is saved to an "xx.RData" it is still burdened, in some capacity, with the objects created in the function. Only deleting these objects at the end of the function ensures the realistic size of the returned object. Performing gc() after each of these short simulations shows that the "Vcells" that are accumulated in the function environment appear to remain after the function returns. These cached remains are then transferred to the .RData upon saving of the object(s). This is occurring quite broadly across the Windows 7 (R 2.10.1) and 64 Bit Ubuntu Linux (R 2.9.0) systems that I use. A similar problem was partially pointed out four years ago http://tolstoy.newcastle.edu.au/R/help/06/03/24060.html and has been made more obvious in the scenarios given below. Admittedly I have had many problems with workspace .RData sizes over the years and it has taken me some time to realise what is actually occurring. Can someone enlighten myself and my colleagues as to why the objects created and evaluated in a function call stack are saved, in some capacity, with the returned object? Cheers, Julian ####################### small simulation from a clean directory lmfunc <- function(loop = 20, add = FALSE, gr = FALSE){ lmlist <- rmlist <- list() set.seed(100) dat <- data.frame(matrix(rnorm(100*100), ncol = 100)) rm <- matrix(rnorm(100000), ncol = 1000) names(dat)[1] <- "y" i <- 1 for(i in 1:loop) { lmlist[[i]] <- lm(y ~ ., data = dat) if(add) rmlist[[i]] <- rm } fm <- lmlist[[loop]] if(gr) { print(what <- ls(envir = sys.frame(which = 1))) remove(list = setdiff(what, "fm")) } fm } # baseline gc()> gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 153325 4.1 350000 9.4 350000 9.4 Vcells 99228 0.8 786432 6.0 386446 3.0 ###### 1. simple lm() simulation> lmtest1 <- lmfunc()> gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 184470 5.0 407500 10.9 350000 9.4 Vcells 842169 6.5 1300721 10.0 1162577 8.9> save(lmtest1, file = "lm1.RData")> system("ls -s lm1.RData")4312 lm1.RData ## A moderate increase in Vcells; .RData object around 4.5 Mb ###### 2. add matrix calculation to loop> lmtest2 <- lmfunc(add = TRUE)> gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 209316 5.6 407500 10.9 405340 10.9 Vcells 3584244 27.4 4175939 31.9 3900869 29.8> save(lmtest2, file = "lm2.RData")> system("ls -s lm2.RData")19324 lm2.RData ## A enormous increase in Vcells; .RData object is now 19Mb+ ###### 3. delete all objects in function call stack> lmtest3 <- lmfunc(add = TRUE, gr = TRUE)> gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 210766 5.7 467875 12.5 467875 12.5 Vcells 3615863 27.6 6933688 52.9 6898609 52.7> save(lmtest3, file = "lm3.RData")> system("ls -s lm3.RData")320 lm3.RData ## A minimal increase in Vcells; .RData object is now 320Kb> sapply(ls(pattern = "lmtest*"), function(x) object.size(get(x, envir = .GlobalEnv)))lmtest1 lmtest2 lmtest3 358428 358428 358428 ## all objects are deemed the same size by object.size() ######################### End sim -- --- Dr. Julian Taylor phone: +61 8 8303 8792 Postdoctoral Fellow fax: +61 8 8303 8763 CMIS, CSIRO mobile: +61 4 1638 8180 Private Mail Bag 2 email: julian.taylor@csiro.au Glen Osmond, SA, 5064 --- [[alternative HTML version deleted]]
Duncan Murdoch
2010-Jul-07 12:12 UTC
[Rd] Large discrepancies in the same object being saved to .RData
On 06/07/2010 9:04 PM, Julian.Taylor at csiro.au wrote:> Hi developers, > > > > After some investigation I have found there can be large discrepancies in the same object being saved as an external "xx.RData" file. The immediate repercussion of this is the possible increased size of your .RData workspace for no apparent reason. > > > > The function and its three scenarios below highlight these discrepancies. Note that the object being returned is exactly the same in each circumstance. The first scenario simply loops over a set of lm() models from a simulated set of data. The second adds a reasonably large matrix calculation within the loop. The third highlights exactly where the discrepancy lies. It appears that when the object is saved to an "xx.RData" it is still burdened, in some capacity, with the objects created in the function. Only deleting these objects at the end of the function ensures the realistic size of the returned object. Performing gc() after each of these short simulations shows that the "Vcells" that are accumulated in the function environment appear to remain after the function returns. These cached remains are then transferred to the .RData upon saving of the object(s). This is occurring quite broadly across the Windows 7 (R 2.10.1) and 64 Bit Ubuntu Linux (R 2.9.0) systems that I use. > > > > A similar problem was partially pointed out four years ago > > > > http://tolstoy.newcastle.edu.au/R/help/06/03/24060.html > > > > and has been made more obvious in the scenarios given below. > > > > Admittedly I have had many problems with workspace .RData sizes over the years and it has taken me some time to realise what is actually occurring. Can someone enlighten myself and my colleagues as to why the objects created and evaluated in a function call stack are saved, in some capacity, with the returned object? >I haven't worked through your example, but in general the way that local objects get captured is when part of the return value includes an environment. Examples of things that include an environment are locally created functions and formulas. It's probably the latter that you're seeing. When R computes the result of "y ~ ." or a similar formula, it attaches a pointer to the environment in which the calculation took place, so that later when the formula is used, it can look up y there. For example, in your line lm(y ~ ., data = dat) from your code, the formula "y ~ ." needs to be computed before R knows that you've explicitly listed a dataframe holding the data, and before it knows whether the variable y is in that dataframe or is just a local variable in the current function. Since these are just pointers to the environment, this doesn't take up much space in memory, but when you save the object to disk, a copy of the whole environment will be made, and that can end up wasting up a lot of space if the environment contains a lot of things that aren't needed by the formula. Duncan Murdoch> > Cheers, > > Julian > > > > ####################### small simulation from a clean directory > > > > lmfunc <- function(loop = 20, add = FALSE, gr = FALSE){ > > lmlist <- rmlist <- list() > > set.seed(100) > > dat <- data.frame(matrix(rnorm(100*100), ncol = 100)) > > rm <- matrix(rnorm(100000), ncol = 1000) > > names(dat)[1] <- "y" > > i <- 1 > > for(i in 1:loop) { > > lmlist[[i]] <- lm(y ~ ., data = dat) > > if(add) > > rmlist[[i]] <- rm > > } > > fm <- lmlist[[loop]] > > if(gr) { > > print(what <- ls(envir = sys.frame(which = 1))) > > remove(list = setdiff(what, "fm")) > > } > > fm > > } > > > > # baseline gc() > > > > >> gc() >> > > used (Mb) gc trigger (Mb) max used (Mb) > > Ncells 153325 4.1 350000 9.4 350000 9.4 > > Vcells 99228 0.8 786432 6.0 386446 3.0 > > > > ###### 1. simple lm() simulation > > > > >> lmtest1 <- lmfunc() >> > > >> gc() >> > > used (Mb) gc trigger (Mb) max used (Mb) > > Ncells 184470 5.0 407500 10.9 350000 9.4 > > Vcells 842169 6.5 1300721 10.0 1162577 8.9 > > > > >> save(lmtest1, file = "lm1.RData") >> > > >> system("ls -s lm1.RData") >> > > 4312 lm1.RData > > > > ## A moderate increase in Vcells; .RData object around 4.5 Mb > > > > ###### 2. add matrix calculation to loop > > > > >> lmtest2 <- lmfunc(add = TRUE) >> > > >> gc() >> > > used (Mb) gc trigger (Mb) max used (Mb) > > Ncells 209316 5.6 407500 10.9 405340 10.9 > > Vcells 3584244 27.4 4175939 31.9 3900869 29.8 > > > > >> save(lmtest2, file = "lm2.RData") >> > > >> system("ls -s lm2.RData") >> > > 19324 lm2.RData > > > > ## A enormous increase in Vcells; .RData object is now 19Mb+ > > > > ###### 3. delete all objects in function call stack > > > > >> lmtest3 <- lmfunc(add = TRUE, gr = TRUE) >> > > >> gc() >> > > used (Mb) gc trigger (Mb) max used (Mb) > > Ncells 210766 5.7 467875 12.5 467875 12.5 > > Vcells 3615863 27.6 6933688 52.9 6898609 52.7 > > > > >> save(lmtest3, file = "lm3.RData") >> > > >> system("ls -s lm3.RData") >> > > 320 lm3.RData > > > > ## A minimal increase in Vcells; .RData object is now 320Kb > > > > >> sapply(ls(pattern = "lmtest*"), function(x) object.size(get(x, envir = .GlobalEnv))) >> > > lmtest1 lmtest2 lmtest3 > > 358428 358428 358428 > > > > ## all objects are deemed the same size by object.size() > > ######################### End sim > > -- > --- > Dr. Julian Taylor phone: +61 8 8303 8792 > Postdoctoral Fellow fax: +61 8 8303 8763 > CMIS, CSIRO mobile: +61 4 1638 8180 > Private Mail Bag 2 email: julian.taylor at csiro.au > Glen Osmond, SA, 5064 > --- > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Bill.Venables at csiro.au
2010-Jul-11 02:10 UTC
[Rd] Large discrepancies in the same object being saved to .RData
Well, I have answered one of my questions below. The hidden environment is attached to the 'terms' component of v1. To see this> lapply(v1, environment)$coefficients NULL $residuals NULL $effects NULL $rank NULL $fitted.values NULL $assign NULL $qr NULL $df.residual NULL $xlevels NULL $call NULL $terms <environment: 0x021b9e18> $model NULL> rm(junk, envir = with(v1, environment(terms))) > usedVcells()[1] 96532>This is still a bit of a trap for young (and old!) players... I think the main point in my mind is why is it that object.size() excludes enclosing environments in its reckonings? Bill Venables. -----Original Message----- From: Venables, Bill (CMIS, Cleveland) Sent: Sunday, 11 July 2010 11:40 AM To: 'Duncan Murdoch'; 'Paul Johnson' Cc: 'r-devel at r-project.org'; Taylor, Julian (CMIS, Waite Campus) Subject: RE: [Rd] Large discrepancies in the same object being saved to .RData I'm still a bit puzzled by the original question. I don't think it has much to do with .RData files and their sizes. For me the puzzle comes much earlier. Here is an example of what I mean using a little session> usedVcells <- function() gc()["Vcells", "used"] > usedVcells() ### the base load[1] 96345 ### Now look at what happens when a function returns a formula as the ### value, with a big item floating around in the function closure:> f0 <- function() {+ junk <- rnorm(10000000) + y ~ x + }> v0 <- f0() > usedVcells() ### much bigger than base, why?[1] 10096355> v0 ### no obvious envirnomenty ~ x> object.size(v0) ### so far, no clue given where### the extra Vcells are located. 372 bytes ### Does v0 have an enclosing environment?> environment(v0) ### yep.<environment: 0x021cc538>> ls(envir = environment(v0)) ### as expected, there's the junk[1] "junk"> rm(junk, envir = environment(v0)) ### this does the trick. > usedVcells()[1] 96355 ### Now consider a second example where the object ### is not a formula, but contains one.> f1 <- function() {+ junk <- rnorm(10000000) + x <- 1:3 + y <- rnorm(3) + lm(y ~ x) + }> v1 <- f1() > usedVcells() ### as might have been expected.[1] 10096455 ### in this case, though, there is no ### (obvious) enclosing environment> environment(v1)NULL> object.size(v1) ### so where are the junk Vcells located?7744 bytes> ls(envir = environment(v1)) ### clearly wil not workError in ls(envir = environment(v1)) : invalid 'envir' argument> rm(v1) ### removing the object does clear out the junk. > usedVcells()[1] 96366>And in this second case, as noted by Julian Taylor, if you save() the object the .RData file is also huge. There is an environment attached to the object somewhere, but it appears to be occluded and entirely inaccessible. (I have poked around the object components trying to find the thing but without success.) Have I missed something? Bill Venables. -----Original Message----- From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Duncan Murdoch Sent: Sunday, 11 July 2010 10:36 AM To: Paul Johnson Cc: r-devel at r-project.org Subject: Re: [Rd] Large discrepancies in the same object being saved to .RData On 10/07/2010 2:33 PM, Paul Johnson wrote:> On Wed, Jul 7, 2010 at 7:12 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote: > >> On 06/07/2010 9:04 PM, Julian.Taylor at csiro.au wrote: >> >>> Hi developers, >>> >>> >>> >>> After some investigation I have found there can be large discrepancies in >>> the same object being saved as an external "xx.RData" file. The immediate >>> repercussion of this is the possible increased size of your .RData workspace >>> for no apparent reason. >>> >>> >>> >>> >> I haven't worked through your example, but in general the way that local >> objects get captured is when part of the return value includes an >> environment. >> > > Hi, can I ask a follow up question? > > Is there a tool to browse *.Rdata files without loading them into R? >I don't know of one. You can load the whole file into an empty environment, but then you lose information about "where did it come from"? Duncan Murdoch> In HDF5 (a data storage format we use sometimes), there is a CLI > program "h5dump" that will spit out line-by-line all the contents of a > storage entity. It will literally track through all the metadata, all > the vectors of scores, etc. I've found that handy to "see what's > really in there" in cases like the one that OP asked about. > Sometimes, we find that there are things that are "in there" by > mistake, as Duncan describes, and then we can try to figure why they > are in there. > > pj > > >______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Terry Therneau
2010-Jul-12 13:44 UTC
[Rd] Large discrepancies in the same object being saved to .RData
I only wish to add a request for further documentation of hidden environments, their consequences, and how to turn them off. Perhaps a page in the Extending R guide, and a suggestion for book authors. I was bitten by this with the coxph frailty functions. They are called during the model frame creation and create a matrix object with various small attached functions as attributes. In creating the 'x' columns they have to deal with factors and can create a huge transient temporary matrix while doing so; something that will never be needed again. A user was exceeding disk quotas when he saved a model fit. As someone with years of experience with functional languages (which S once was), I wasn't used to the idea that one would have to take explicit --- and mysterious --- steps to make local variables go away. This discussion has revealed that the hidden rules causing local variables to be kept are more complex than I thought. Perhaps a "don't save environments" option to save could be added to help mere mortals get rid of all this stuff in the attic (with its secret staircase)? [[Soapbox on]] Environments have proven useful for many things, and certainly aren't going away. But to quote the bard "Oh what tangled webs we weave, when first we practice to decieve." Terry Therneau