I have a lot of old untidy data in excel files (50 sheets with 400-500 rows each). A part of my data looks like this:
Elements= c("Project name ONE","John","Smith","Sara","Project name TWO","stardust","soil","sunflower","juice","doe","tobacco", "Project name THREE","phi","rho","omega")
Units= c("NA", "3", "5", "6", "NA", "21", "19", "31", "24", "1", "5", "NA", "21", "21", "22")
df= data.frame(Elements, Units)
In my large data set each project has a very different number of rows.
I want to create new column "Group" where each line for every project is defined. For the above sample the results would be like this
Group =c(1,1,1,1,2,2,2,2,2,2,2,3,3,3,3)
df =c(Elements, Units, Group)
But I also would like to sum values from the column "Units" for all values below each "empty" cell into a new column "Sum".
Sum= c("14", "NA", "NA", "NA", "101", "NA", "NA", "NA", "NA", "NA", "NA", "9", "NA", "NA", "NA")
My final product would then look like this:
df =c(Elements, Units, Group, Sum)
CodePudding user response:
You could do sth. like this:
Note that I changed your input example so that missings are not declared as strings of "NA", but are real missings (NA):
df <- data.frame(Elements = c("Project name ONE","John","Smith","Sara","Project name TWO","stardust","soil","sunflower","juice","doe","tobacco", "Project name THREE","phi","rho","omega"),
Units = c(NA, "3", "5", "6", NA, "21", "19", "31", "24", "1", "5", NA, "21", "21", "22"))
library(tidyverse)
df %>%
mutate(project = if_else(is.na(Units), Elements, NA_character_),
Units = as.numeric(Units)) %>%
fill(project) %>%
group_by(project) %>%
filter(row_number() != 1) %>%
mutate(Sum = if_else(row_number() == 1, sum(Units, na.rm = TRUE), NA_real_)) %>%
ungroup()
# A tibble: 12 x 4
Elements Units project Sum
<chr> <dbl> <chr> <dbl>
1 John 3 Project name ONE 14
2 Smith 5 Project name ONE NA
3 Sara 6 Project name ONE NA
4 stardust 21 Project name TWO 101
5 soil 19 Project name TWO NA
6 sunflower 31 Project name TWO NA
7 juice 24 Project name TWO NA
8 doe 1 Project name TWO NA
9 tobacco 5 Project name TWO NA
10 phi 21 Project name THREE 64
11 rho 21 Project name THREE NA
12 omega 22 Project name THREE NA
So what are we doing?
- We define the project (or group) by taking the "Elements" for which Units are NA and then fill it downwards.
- We also turn your Units column to numeric (in your example it is a character variable).
- We then group by the project.
- We filter out the first row because it contains the project name which now has its own column.
- And then we calculate the sum of Units for each project and put taht into the first row of each project.
If you don't want to cut teh first row that contains the project name in "Elements" you can simply delete the code line with filter(...)
.
CodePudding user response:
You could also do:
df %>%
type_convert() %>%
group_by(grp = cumsum(is.na(Units))) %>%
mutate(Sum = (NA^(row_number() != 1))* sum(Units, na.rm = TRUE))
# A tibble: 15 x 4
# Groups: grp [3]
Elements Units grp Sum
<chr> <dbl> <int> <dbl>
1 Project name ONE NA 1 14
2 John 3 1 NA
3 Smith 5 1 NA
4 Sara 6 1 NA
5 Project name TWO NA 2 101
6 stardust 21 2 NA
7 soil 19 2 NA
8 sunflower 31 2 NA
9 juice 24 2 NA
10 doe 1 2 NA
11 tobacco 5 2 NA
12 Project name THREE NA 3 64
13 phi 21 3 NA
14 rho 21 3 NA
15 omega 22 3 NA