Home > Back-end >  Counting data in a curved Data Frame in R
Counting data in a curved Data Frame in R

Time:07-02

There is an unloading data frame with a large number of lines (several tens of thousands). The problem is that the date frame was loaded incorrectly - there are gaps in the data.

Example:

df <- data.frame(Region=c(NA,'region2','region1',NA,'region3','region3',NA,'region5'), Event=c('region1','event1','event1','region2','event1','event1','region4','event1'), Campaign=c('event1','campaign2','campaing1','event1','campaing1','campaign2','event1',NA))

I need to count the number of Event for each Region. All I found is to apply such code:

df <- df %>%
  select(Region, Event) %>% 
  group_by(Region, Event) %>%
  summarise(Event = n())

But as you understand, this does not solve the problem :) Because it leaves a lot of missing data...

But in this date frame, we see that a certain pattern is preserved - after the Region column there is always an Event column! (The Event is always the same, but there are many Regions).

How can you use R in this case to count all the Events for each Region?

CodePudding user response:

from what I understand your main problem is in the rows of your data. Some rows have a blank in the first colum and than start with the value of the frist in the second.

df

   Region   Event  Campaign
1    <NA> region1    event1
2 region2  event1 campaign2
3 region1  event1 campaing1
4    <NA> region2    event1
5 region3  event1 campaing1
6 region3  event1 campaign2
7    <NA> region4    event1
8 region5  event1      <NA>

one possible way to correct this is, to work the data as two parts. One with the correct value allocation per column and the problematic values which are corrected and everything is binded together in the end for analysis:

library(dplyr)

dfn <- df %>% 
    # lets get the problematic values first
    dplyr::filter(is.na(Region)) %>% 
    # migrate values between columns
    dplyr::transmute(Region = Event, 
                     Event = Campaign, 
                     Campaign = NA) %>%
    # bind the data with correct allocations 
    dplyr::union_all(dplyr::filter(df, !is.na(Region)))

# now the analysis should give the wanted results
dfn %>% 
    dplyr::count(Region)

   Region n
1 region1 2
2 region2 2
3 region3 2
4 region4 1
5 region5 1

as per the comment here is an alternative way to get the value of the colum prior to "event1" occurence (same row)

df %>% 
    # work the data row wise
    dplyr::rowwise() %>% 
    # look up content in vector of column content of that row, subtract 1 and look this up the the vector of all column content
    dplyr::mutate(res = c(Region, Event, Campaign)[match("event1", c(Region, Event, Campaign)) - 1]) %>%
    # release row binding (prevents unwanted behaviour down stream)
    dplyr::ungroup() %>%
    # count
    dplyr::count(res)

# A tibble: 5 x 2
  res         n
  <chr>   <int>
1 region1     2
2 region2     2
3 region3     2
4 region4     1
5 region5     1
  • Related