Data file organized in block per year: add column for year to combine all data-CodePudding

I have large quantities of datasets in Excel that I would like to analyze in R. The files have a format that organizes all information per block of the same year, which looks like:

Group    <- c(2010, 'Group', 'A', 'B', 'C', 2011, 'Group', 'A', 'B', 'E', 2012, 'Group', 'A', 'B')
Value <- c(NA,'Value', 1, 2, 9, NA, 'Value', 3, 5, 2, NA, 'Value', 9, 1)

df <- cbind(Group, Value)
        
  Group   Value  
 1: 2010  NA     
 2: Group Value
 3: A     1    
 4: B     2    
 5: C     9    
 6: 2011  NA     
 7: Group Value
 8: A     3    
 9: B     5    
10: E     2    
11: 2012  NA     
12: Group Value
13: A     9  
14: B     1

To be able to analyze the data, I would like to automatically add a column for the year so that all data can be combined, as follows:

   Year Group Value  
 1: 2010  A  1       
 2: 2010  B  2
 3: 2010  C  9        
 4: 2011  A  3     
 5: 2011  B  5        
 6: 2011  E  2
 7: 2012  A  9 
 8: 2012  B  1

CodePudding user response：

Here is one method with tidyverse - create the 'Year' column where the 'Group' values have 4 digits numbers, then filter out the 'Group' rows where value is 'Group', fill the 'Year' column with the previous non-NA values, filter out the first row with duplicated and convert the type (type.convert)

library(dplyr)
library(stringr)
library(tidyr)
 df %>%
   mutate(Year = case_when(str_detect(Group, "^\\d{4}$") ~ Group)) %>% 
   filter(Group != 'Group') %>% 
   fill(Year) %>%
   filter(duplicated(Year)) %>%
   type.convert(as.is = TRUE) %>%
   select(Year, Group, Value)

-output

   Year Group Value
1 2010     A     1
2 2010     B     2
3 2010     C     9
4 2011     A     3
5 2011     B     5
6 2011     E     2
7 2012     A     9
8 2012     B     1

data

df <- data.frame(Group, Value)

CodePudding user response：

library(data.table)
dt <- data.table(df)
dt[, Year := Group[1], cumsum(is.na(Value))][Value != 'Value']

   Group Value Year
1:     A     1 2010
2:     B     2 2010
3:     C     9 2010
4:     A     3 2011
5:     B     5 2011
6:     E     2 2011
7:     A     9 2012
8:     B     1 2012

in Base R:

subset(transform(df, Year = ave(Group, cumsum(is.na(Value)), FUN=\(x)x[1])), Value != 'Value')
   Group Value Year
3      A     1 2010
4      B     2 2010
5      C     9 2010
8      A     3 2011
9      B     5 2011
10     E     2 2011
13     A     9 2012
14     B     1 2012

Note that the above columns are character. You can use type.convert(new_df, as.is = TRUE) where new_df is the resultant df to convert the columns to respective classes