laurent.duperval@microcell.ca
2003-Mar-19 16:45 UTC
[R] How would I analyse data like this?
Hello, I'm a new R user and I'm having a little trouble getting started. I'm hoping someone can help me out. I have data that looks like this: phone|state|code|amount|left|channel|time|mtd 15555551234|3|983|1000|266|IN|2003-03-16 23:57:21-05|C 15555552345|3|983|3000|0|IN|2003-03-16 23:58:16-05|C 15555552346|3|983|1000|40|IN|2003-03-16 23:58:24-05|C Which I've read using scan(). data <- scan(file = "data.dat", what = list("",0,0,0,0,"","",""), sep = "|", skip = 1) Now, I want to do things like this: - A histogram for the 5th column for every 50 units. I can generate the histogram but most of my values are between 0-500. A few are above that. I'd like to bundle them all in a generic 500+ category. I can't figure out how. This is what I'm doing hist(data[[5]], br=c(0,50,100,150,200,250,300,350,400,450,500,1000)) Error in hist.default(data[[5]], br = c(0, 50, 100, 150, 200, 250, 300, : some `x' not counted; maybe `breaks' do not span range of `x' - How do I count the number of times channel "IN" occurs with code = 983? How about if I want to combine IN and code=983 or 982 or 981? - Finally (for today at least) how do I count the number of times code=983 and date=2003-03-16 (without the time) occur. I'm hoping this will also help me build histograms for days of the week and for hours of the day. Thanks, L -- Laurent Duperval <laurent.duperval at microcell.ca> ZYMURGY'S FIRST LAW OF EVOLVING SYSTEM DYNAMICS Once you open a can of worms, the only way to recan them is to use a larger can.
Laurent - Good for you for figuring out scan(). For hist(), append one more number to "breaks" that's guaranteed to be above all the data values. (In the return value from hist(), "breaks" has n+1 entries when the histogram has n bars, and it's a good guess that input=output.) For example: hist(data[[5]], br=c(50*(seq(11)-1), 1000, 1+max(data[[5]]))) For counting, try the function table(). eg: table(data[[3]], data[[6]]) table(data[[3]], data[[7]], data[[6]]) - tom blackwell - u michigan medical school - ann arbor - On Wed, 19 Mar 2003 laurent.duperval at microcell.ca wrote:> Hello, > > I'm a new R user and I'm having a little trouble getting started. I'm hoping > someone can help me out. > > I have data that looks like this: > > phone|state|code|amount|left|channel|time|mtd > 15555551234|3|983|1000|266|IN|2003-03-16 23:57:21-05|C > 15555552345|3|983|3000|0|IN|2003-03-16 23:58:16-05|C > 15555552346|3|983|1000|40|IN|2003-03-16 23:58:24-05|C > > Which I've read using scan(). > > data <- scan(file = "data.dat", what = list("",0,0,0,0,"","",""), sep = "|", skip = 1) > > Now, I want to do things like this: > > - A histogram for the 5th column for every 50 units. I can generate the > histogram but most of my values are between 0-500. A few are above that. > I'd like to bundle them all in a generic 500+ category. I can't figure out > how. This is what I'm doing > > hist(data[[5]], br=c(0,50,100,150,200,250,300,350,400,450,500,1000)) > Error in hist.default(data[[5]], br = c(0, 50, 100, 150, 200, 250, 300, : > some `x' not counted; maybe `breaks' do not span range of `x' > > > - How do I count the number of times channel "IN" occurs with code = 983? How about if > I want to combine IN and code=983 or 982 or 981? > > - Finally (for today at least) how do I count the number of times code=983 and > date=2003-03-16 (without the time) occur. I'm hoping this will also help > me build histograms for days of the week and for hours of the day. > > Thanks, > > L > > -- > Laurent Duperval <laurent.duperval at microcell.ca> > > ZYMURGY'S FIRST LAW OF EVOLVING SYSTEM DYNAMICS > Once you open a can of worms, the only way to recan them is to use > a larger can. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help >
laurent.duperval@microcell.ca
2003-Mar-19 17:40 UTC
[R] How would I analyse data like this?
On 19 Mar, james.holtman at convergys.com wrote:> Have you tried: > data <- read.table("data.dat", header=TRUE, sep="|", as.is=TRUE) >Yes I did. However, it takes a LOT more time because of the date/time string. The result looks like this: str(data) `data.frame': 317437 obs. of 8 variables: $ phone : num 1.52e+10 1.42e+10 1.82e+10 1.65e+10 1.65e+10 ... $ state : int 3 3 3 3 3 3 3 3 3 3 ... $ code : int 983 983 983 983 3000 983 983 983 983 5203 ... $ amount : int 1000 1000 2500 2500 2500 1000 1000 2500 2500 2500 ... $ left : int 260 0 0 25 0 1260 273 0 0 0 ... $ channel : Factor w/ 5 levels "CSR","IN","IVR",..: 2 5 4 2 3 2 2 3 4 3 ... $ time : Factor w/ 312198 levels "2002-10-16 ..",..: 1 2 3 4 5 6 7 8 9 10 ... $ mtd : Factor w/ 2 levels "C","D": 1 1 1 1 1 1 1 1 1 1 ... I think the 312198 factor level is wrong. Also, the phone column is a string, not a number. I didn't see how to specify that with read.table(). (In my original post, I think I forgot to mention that I had over 300,000 entries in my file).> change your 'br' range to: > br=c(0,50,100,150,200,250,300,350,400,450,500,1e10) > to make sure that you include everything in the last range.I tried that, but the result is a graph that is too wide. It treats the range as numerical values instead of bins. Well, to me, anyway. If it's acceptable policy, I can post a screenshot of the result here (about 25K). Everything is bunched up on the left, but the right portion is much larger and contains nothing.>> - How do I count the number of times channel "IN" occurs with code = 983? > How about if >> I want to combine IN and code=983 or 982 or 981? > sum(data$channel == "IN" && (data$code %in% c(983,982,981))Thank, I'll try that.>> >> - Finally (for today at least) how do I count the number of times > code=983 and >> date=2003-03-16 (without the time) occur. I'm hoping this will also > help >> me build histograms for days of the week and for hours of the day. > You need to split off the date from that column with: > > data$date <- unlist(lapply(strsplit(data$time, " "), function(x) x[1])) # > get just date > counts <- table(list(data$date, data$code)) # computes all the counts at > once into matrix >Ok, I'll try all this. While I was writing this message, a few more answers came in. Let me try all those before I reply to them. Thanks to all, L -- Laurent Duperval <laurent.duperval at microcell.ca> "I'm not going to so my maths homework. Look at these unsolved problems. Here's a number in mortal combat with another. One of them is going to get subtracted. But why? What will be left of him? If I answered these, it would kill the suspense. It would resolve the conflict and turn intriguing possibilities into boring old facts." "I never really thought about the literary possibilities of maths." "I prefer to savour the mystery." -Calvin & Hobbes
> ... the graph is too wide ... (from hist()).Ah, yes. Try setting parameter xlim=c(0,1200) in the call to hist(). If that doesn't work, then catch the return value from hist(), and re-plot using barplot(). For example: temp <- hist(data[[5]], breaks=c(50*(seq(11)-1), 1000, max(data[[5]]))) barplot(temp$counts, xlim=c(0,600)) - tom blackwell - u michigan medical school - ann arbor -