thr3ads.net - R help - [R] How would I analyse data like this? [Mar 2003]

If this information is useful, please help other people find it:
Share via:

laurent.duperval@microcell.ca

2003-Mar-19 16:45 UTC

[R] How would I analyse data like this?

Hello,

I'm a new R user and I'm having a little trouble getting started.
I'm hoping
someone can help me out.

I have data that looks like this:

phone|state|code|amount|left|channel|time|mtd
15555551234|3|983|1000|266|IN|2003-03-16 23:57:21-05|C
15555552345|3|983|3000|0|IN|2003-03-16 23:58:16-05|C
15555552346|3|983|1000|40|IN|2003-03-16 23:58:24-05|C

Which I've read using scan(). 

data <- scan(file = "data.dat", what =
list("",0,0,0,0,"","",""), sep =
"|", skip = 1)

Now, I want to do things like this:

- A histogram for the 5th column for every 50 units. I can generate the
  histogram but most of my values are between 0-500. A few are above that.
  I'd like to bundle them all in a generic 500+ category. I can't figure
out
  how. This is what I'm doing

hist(data[[5]], br=c(0,50,100,150,200,250,300,350,400,450,500,1000))
Error in hist.default(data[[5]], br = c(0, 50, 100, 150, 200, 250, 300,  : 
	some `x' not counted; maybe `breaks' do not span range of `x'


- How do I count the number of times channel "IN" occurs with code =
983? How about if
  I want to combine IN and code=983 or 982 or 981?

- Finally (for today at least) how do I count the number of times code=983 and
  date=2003-03-16 (without the time) occur. I'm hoping this will also help
  me build histograms for days of the week and for hours of the day.

  Thanks,

  L
  
-- 
Laurent Duperval <laurent.duperval at microcell.ca>

ZYMURGY'S FIRST LAW OF EVOLVING SYSTEM DYNAMICS
    Once you open a can of worms, the only way to recan them is to use
    a larger can.

Thomas W Blackwell

2003-Mar-19 17:22 UTC

head link

[R] How would I analyse data like this?

Laurent  -

Good for you for figuring out scan().  For hist(), append one more
number to "breaks" that's guaranteed to be above all the data
values.
(In the return value from hist(), "breaks" has n+1 entries when the
histogram has n bars, and it's a good guess that input=output.)

For example:
hist(data[[5]], br=c(50*(seq(11)-1), 1000, 1+max(data[[5]])))

For counting, try the function table().  eg:

table(data[[3]], data[[6]])
table(data[[3]], data[[7]], data[[6]])

-  tom blackwell  -  u michigan medical school  -   ann arbor  -


On Wed, 19 Mar 2003 laurent.duperval at microcell.ca wrote:
> Hello,
>
> I'm a new R user and I'm having a little trouble getting started.
I'm hoping
> someone can help me out.
>
> I have data that looks like this:
>
> phone|state|code|amount|left|channel|time|mtd
> 15555551234|3|983|1000|266|IN|2003-03-16 23:57:21-05|C
> 15555552345|3|983|3000|0|IN|2003-03-16 23:58:16-05|C
> 15555552346|3|983|1000|40|IN|2003-03-16 23:58:24-05|C
>
> Which I've read using scan().
>
> data <- scan(file = "data.dat", what =
list("",0,0,0,0,"","",""), sep =
"|", skip = 1)
>
> Now, I want to do things like this:
>
> - A histogram for the 5th column for every 50 units. I can generate the
>   histogram but most of my values are between 0-500. A few are above that.
>   I'd like to bundle them all in a generic 500+ category. I can't
figure out
>   how. This is what I'm doing
>
> hist(data[[5]], br=c(0,50,100,150,200,250,300,350,400,450,500,1000))
> Error in hist.default(data[[5]], br = c(0, 50, 100, 150, 200, 250, 300,  :
> 	some `x' not counted; maybe `breaks' do not span range of `x'
>
>
> - How do I count the number of times channel "IN" occurs with
code = 983? How about if
>   I want to combine IN and code=983 or 982 or 981?
>
> - Finally (for today at least) how do I count the number of times code=983
and
>   date=2003-03-16 (without the time) occur. I'm hoping this will also
help
>   me build histograms for days of the week and for hours of the day.
>
>   Thanks,
>
>   L
>
> --
> Laurent Duperval <laurent.duperval at microcell.ca>
>
> ZYMURGY'S FIRST LAW OF EVOLVING SYSTEM DYNAMICS
>     Once you open a can of worms, the only way to recan them is to use
>     a larger can.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>

laurent.duperval@microcell.ca

2003-Mar-19 17:40 UTC

head link

[R] How would I analyse data like this?

On 19 Mar, james.holtman at convergys.com wrote:> Have you tried:
>       data <- read.table("data.dat", header=TRUE,
sep="|", as.is=TRUE)
> 
Yes I did. However, it takes a LOT more time because of the date/time
string. The result looks like this:


str(data)
`data.frame':	317437 obs. of  8 variables:
 $ phone   : num  1.52e+10 1.42e+10 1.82e+10 1.65e+10 1.65e+10 ...
 $ state   : int  3 3 3 3 3 3 3 3 3 3 ...
 $ code    : int  983 983 983 983 3000 983 983 983 983 5203 ...
 $ amount  : int  1000 1000 2500 2500 2500 1000 1000 2500 2500 2500 ...
 $ left    : int  260 0 0 25 0 1260 273 0 0 0 ...
 $ channel : Factor w/ 5 levels
"CSR","IN","IVR",..: 2 5 4 2 3 2 2 3 4 3 ...
 $ time    : Factor w/ 312198 levels "2002-10-16 ..",..: 1 2 3 4 5 6 7
8 9 10 ...
 $ mtd     : Factor w/ 2 levels "C","D": 1 1 1 1 1 1 1 1 1 1
...

I think the 312198 factor level is wrong. Also, the phone column is  a string,
not a number. I didn't see how to specify that with read.table(). (In my
original post, I think I forgot to mention that I had over 300,000 entries in
my file).
> change your 'br' range to:
>       br=c(0,50,100,150,200,250,300,350,400,450,500,1e10)
> to make sure that you include everything in the last range.
I tried that, but the result is a graph that is too wide. It treats the range
as numerical values instead of bins. Well, to me, anyway. If it's acceptable
policy, I can post a screenshot of the result here (about 25K). Everything is
bunched up on the left, but the right portion is much larger and contains
nothing.
>> - How do I count the number of times channel "IN" occurs with
code = 983?
> How about if
>>   I want to combine IN and code=983 or 982 or 981?
> sum(data$channel == "IN" && (data$code %in%
c(983,982,981))
Thank, I'll try that.
>>
>> - Finally (for today at least) how do I count the number of times
> code=983 and
>>   date=2003-03-16 (without the time) occur. I'm hoping this will
also
> help
>>   me build histograms for days of the week and for hours of the day.
> You need to split off the date from that column with:
> 
> data$date <- unlist(lapply(strsplit(data$time, " "),
function(x) x[1]))  #
> get just date
> counts <- table(list(data$date, data$code))  # computes all the counts
at
> once into matrix
> 
Ok, I'll try all this.

While I was writing this message, a few more answers came in. Let me try
all those before I reply to them.


Thanks to all,

L

-- 
Laurent Duperval <laurent.duperval at microcell.ca>

"I'm not going to so my maths homework. Look at these unsolved
problems. Here's a number in mortal combat with another. One of them is
going to get subtracted. But why? What will be left of him? If I answered these,
it would kill the suspense. It would resolve the conflict and turn intriguing
possibilities into boring old facts."
"I never really thought about the literary possibilities of maths."
"I prefer to savour the mystery."
                                           -Calvin & Hobbes

Thomas W Blackwell

2003-Mar-19 18:14 UTC

head link

[R] How would I analyse data like this?

> ... the graph is too wide ... (from hist()).
Ah, yes.

Try setting parameter  xlim=c(0,1200)  in the call to hist().

If that doesn't work, then catch the return value from hist(),
and re-plot using barplot().  For example:

temp <- hist(data[[5]], breaks=c(50*(seq(11)-1), 1000, max(data[[5]])))
barplot(temp$counts, xlim=c(0,600))

-  tom blackwell  -  u michigan medical school  -  ann arbor  -

Maybe Matching Threads

Search for more maybe matching threads

R help - Mar 2003 - How would I analyse data like this?

[R] How would I analyse data like this?

[R] How would I analyse data like this?

[R] How would I analyse data like this?

[R] How would I analyse data like this?

Maybe Matching Threads