I have large quantities of datasets in Excel that I would like to analyze in R. The files have a format that organizes all information per block of the same year, which looks like:
Group <- c(2010, 'Group', 'A', 'B', 'C', 2011, 'Group', 'A', 'B', 'E', 2012, 'Group', 'A', 'B')
Value <- c(NA,'Value', 1, 2, 9, NA, 'Value', 3, 5, 2, NA, 'Value', 9, 1)
df <- cbind(Group, Value)
Group Value
1: 2010 NA
2: Group Value
3: A 1
4: B 2
5: C 9
6: 2011 NA
7: Group Value
8: A 3
9: B 5
10: E 2
11: 2012 NA
12: Group Value
13: A 9
14: B 1
To be able to analyze the data, I would like to automatically add a column for the year so that all data can be combined, as follows:
Year Group Value
1: 2010 A 1
2: 2010 B 2
3: 2010 C 9
4: 2011 A 3
5: 2011 B 5
6: 2011 E 2
7: 2012 A 9
8: 2012 B 1
CodePudding user response:
Here is one method with tidyverse
- create the 'Year' column where the 'Group' values have 4 digits numbers, then filter
out the 'Group' rows where value is 'Group', fill
the 'Year' column with the previous non-NA values, filter
out the first row with duplicated
and convert the type (type.convert
)
library(dplyr)
library(stringr)
library(tidyr)
df %>%
mutate(Year = case_when(str_detect(Group, "^\\d{4}$") ~ Group)) %>%
filter(Group != 'Group') %>%
fill(Year) %>%
filter(duplicated(Year)) %>%
type.convert(as.is = TRUE) %>%
select(Year, Group, Value)
-output
Year Group Value
1 2010 A 1
2 2010 B 2
3 2010 C 9
4 2011 A 3
5 2011 B 5
6 2011 E 2
7 2012 A 9
8 2012 B 1
data
df <- data.frame(Group, Value)
CodePudding user response:
library(data.table)
dt <- data.table(df)
dt[, Year := Group[1], cumsum(is.na(Value))][Value != 'Value']
Group Value Year
1: A 1 2010
2: B 2 2010
3: C 9 2010
4: A 3 2011
5: B 5 2011
6: E 2 2011
7: A 9 2012
8: B 1 2012
in Base R:
subset(transform(df, Year = ave(Group, cumsum(is.na(Value)), FUN=\(x)x[1])), Value != 'Value')
Group Value Year
3 A 1 2010
4 B 2 2010
5 C 9 2010
8 A 3 2011
9 B 5 2011
10 E 2 2011
13 A 9 2012
14 B 1 2012
Note that the above columns are character. You can use type.convert(new_df, as.is = TRUE)
where new_df
is the resultant df to convert the columns to respective classes