There is an unloading data frame with a large number of lines (several tens of thousands). The problem is that the date frame was loaded incorrectly - there are gaps in the data.
Example:
df <- data.frame(Region=c(NA,'region2','region1',NA,'region3','region3',NA,'region5'), Event=c('region1','event1','event1','region2','event1','event1','region4','event1'), Campaign=c('event1','campaign2','campaing1','event1','campaing1','campaign2','event1',NA))
I need to count the number of Event
for each Region
.
All I found is to apply such code:
df <- df %>%
select(Region, Event) %>%
group_by(Region, Event) %>%
summarise(Event = n())
But as you understand, this does not solve the problem :) Because it leaves a lot of missing data...
But in this date frame, we see that a certain pattern is preserved - after the Region column there is always an Event column! (The Event is always the same, but there are many Regions).
How can you use R in this case to count all the Events for each Region?
CodePudding user response:
from what I understand your main problem is in the rows of your data. Some rows have a blank in the first colum and than start with the value of the frist in the second.
df
Region Event Campaign
1 <NA> region1 event1
2 region2 event1 campaign2
3 region1 event1 campaing1
4 <NA> region2 event1
5 region3 event1 campaing1
6 region3 event1 campaign2
7 <NA> region4 event1
8 region5 event1 <NA>
one possible way to correct this is, to work the data as two parts. One with the correct value allocation per column and the problematic values which are corrected and everything is binded together in the end for analysis:
library(dplyr)
dfn <- df %>%
# lets get the problematic values first
dplyr::filter(is.na(Region)) %>%
# migrate values between columns
dplyr::transmute(Region = Event,
Event = Campaign,
Campaign = NA) %>%
# bind the data with correct allocations
dplyr::union_all(dplyr::filter(df, !is.na(Region)))
# now the analysis should give the wanted results
dfn %>%
dplyr::count(Region)
Region n
1 region1 2
2 region2 2
3 region3 2
4 region4 1
5 region5 1
as per the comment here is an alternative way to get the value of the colum prior to "event1" occurence (same row)
df %>%
# work the data row wise
dplyr::rowwise() %>%
# look up content in vector of column content of that row, subtract 1 and look this up the the vector of all column content
dplyr::mutate(res = c(Region, Event, Campaign)[match("event1", c(Region, Event, Campaign)) - 1]) %>%
# release row binding (prevents unwanted behaviour down stream)
dplyr::ungroup() %>%
# count
dplyr::count(res)
# A tibble: 5 x 2
res n
<chr> <int>
1 region1 2
2 region2 2
3 region3 2
4 region4 1
5 region5 1