We are new to R and evaluating if we can use it for a project we need to do. We have read that R is not well suited to handle very large data sets. Assuming we have the data prepped and stored in an RDBMS (Oracle, Teradata, SQL Server), what can R reasonably handle from a volume perspective? Are there some guidelines on memory/machine sizing based on data volume? We need to be able to handle Millions of Rows from several sources. Any advice is much appreciated. Thanks. [[alternative HTML version deleted]]
Dear Jeff, R works fine for 220000 rows that i tested on a home PC with XP . Memory is limited to hardware that you have. I suggest beefing up RAM to 2 GB and hard disk space and then working it out. I evaluated R too on my site www.decisionstats.com and I found it comparable if not better to SPSS , SAS. As a beginner , and in corporate projects try using the *GUI* R Commander or the *Data Mining GUI Rattle *, its faster and will help you skip some steps, you can also look at code generated side by side to learn the language......... I am not sure on the server client version, but that should work too ...... Also look at the book http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf that helps you as a reference guide. Rest of details are on my site www.decisionstats.com Also *try the software WPS* http://www.teamwpc.co.uk/products/wps, which uses SAS language and provides the same functionality at 10-20 % of cost for millions of rows. Hope this helps, Ajay On Tue, Apr 8, 2008 at 7:56 PM, Jeff Royce <Jeff.Royce@wnco.com> wrote:> We are new to R and evaluating if we can use it for a project we need to > do. We have read that R is not well suited to handle very large data > sets. Assuming we have the data prepped and stored in an RDBMS (Oracle, > Teradata, SQL Server), what can R reasonably handle from a volume > perspective? Are there some guidelines on memory/machine sizing based > on data volume? We need to be able to handle Millions of Rows from > several sources. Any advice is much appreciated. Thanks. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
On Tue, Apr 08, 2008 at 09:26:22AM -0500, Jeff Royce wrote:> We are new to R and evaluating if we can use it for a project we need to > do. We have read that R is not well suited to handle very large data > sets. Assuming we have the data prepped and stored in an RDBMS (Oracle, > Teradata, SQL Server), what can R reasonably handle from a volume > perspective? Are there some guidelines on memory/machine sizing based > on data volume? We need to be able to handle Millions of Rows from > several sources.As so often the answer is "it depends". R does not have an inherent maximum number of rows it can deal with - the available memory determines how big a dataset you can fit into RAM. So often the answer would be "yes - just buy more RAM". A couple million rows are no problem at all if you don't have too many columns (done that). If you realy have a very large set of data which you cannot fit into memory, you may still be able to use R: Do you really need ALL data in memory at the same time? Often, very large datasets actually contain many different subsets of data which you want to analyze separately, anyway. The solution of storing the full data in an RDBMS and selecting the required subsets as needed is the best solution. In your situation, I would simply load the full dataset into R and see what happens. cu Philipp -- Dr. Philipp Pagel Tel. +49-8161-71 2131 Lehrstuhl f?r Genomorientierte Bioinformatik Fax. +49-8161-71 2186 Technische Universit?t M?nchen Wissenschaftszentrum Weihenstephan 85350 Freising, Germany and Institut f?r Bioinformatik und Systembiologie / MIPS Helmholtz Zentrum M?nchen - Deutsches Forschungszentrum f?r Gesundheit und Umwelt Ingolst?dter Landstrasse 1 85764 Neuherberg, Germany http://mips.gsf.de/staff/pagel
> We are new to R and evaluating if we can use it for a project we need to > do. We have read that R is not well suited to handle very large data > sets. Assuming we have the data prepped and stored in an RDBMS (Oracle, > Teradata, SQL Server), what can R reasonably handle from a volume > perspective? Are there some guidelines on memory/machine sizing based > on data volume? We need to be able to handle Millions of Rows from > several sources. Any advice is much appreciated. Thanks.The most important thing is what type of analysis do you want to do with the data? Is the algorithm that implements the analysis O(n), O(n log n) or O(n^2) ? Hadley -- http://had.co.nz/
Millions of rows can be a problem if all is loaded into memory, depending on type of data. Numeric should be fine but if you have strings and you would want to process based on that column (string comparisons etc) then it would be slow. You may want to combine sources outside - stored procedures maybe - and then load to R. Joining of data within R code can be costly if you are selecting from a data frame based on a string. I have, personally, run into 'out of memory' problems only beyond 1G of data on a windows 32 bit system with 3 GB RAM. That happens with C++ also. Regarding speed, I find MATLAB faster than R for matrix operations. In other areas they are in same range. R is much better to program as it is has a much more complete programming language. R can use multiple cores / cpus with a suitable multi threaded linear algebra library. Though this will only be for linear algebra operations. 64 bit binary for R is not available for windows. Sankalp Jeff Royce wrote:> We are new to R and evaluating if we can use it for a project we need to > do. We have read that R is not well suited to handle very large data > sets. Assuming we have the data prepped and stored in an RDBMS (Oracle, > Teradata, SQL Server), what can R reasonably handle from a volume > perspective? Are there some guidelines on memory/machine sizing based > on data volume? We need to be able to handle Millions of Rows from > several sources. Any advice is much appreciated. Thanks. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >