Define group between empty column values (untidy data with no data "title rows")-CodePudding

I have a lot of old untidy data in excel files (50 sheets with 400-500 rows each). A part of my data looks like this:

Elements= c("Project name ONE","John","Smith","Sara","Project name TWO","stardust","soil","sunflower","juice","doe","tobacco", "Project name THREE","phi","rho","omega")

Units= c("NA", "3", "5", "6", "NA", "21", "19", "31", "24", "1", "5", "NA", "21", "21", "22")

df= data.frame(Elements, Units)

In my large data set each project has a very different number of rows.

I want to create new column "Group" where each line for every project is defined. For the above sample the results would be like this

Group =c(1,1,1,1,2,2,2,2,2,2,2,3,3,3,3)

df =c(Elements, Units, Group)

But I also would like to sum values from the column "Units" for all values below each "empty" cell into a new column "Sum".

Sum= c("14", "NA", "NA", "NA", "101", "NA", "NA", "NA", "NA", "NA", "NA", "9", "NA", "NA", "NA")

My final product would then look like this:

df =c(Elements, Units, Group, Sum)

CodePudding user response：

You could do sth. like this:

Note that I changed your input example so that missings are not declared as strings of "NA", but are real missings (NA):

df <- data.frame(Elements = c("Project name ONE","John","Smith","Sara","Project name TWO","stardust","soil","sunflower","juice","doe","tobacco", "Project name THREE","phi","rho","omega"),
                 Units    = c(NA, "3", "5", "6", NA, "21", "19", "31", "24", "1", "5", NA, "21", "21", "22"))

library(tidyverse)
df %>%
  mutate(project = if_else(is.na(Units), Elements, NA_character_),
         Units   = as.numeric(Units)) %>%
  fill(project) %>%
  group_by(project) %>%
  filter(row_number() != 1) %>%
  mutate(Sum = if_else(row_number() == 1, sum(Units, na.rm = TRUE), NA_real_)) %>%
  ungroup()

# A tibble: 12 x 4
   Elements  Units project              Sum
   <chr>     <dbl> <chr>              <dbl>
 1 John          3 Project name ONE      14
 2 Smith         5 Project name ONE      NA
 3 Sara          6 Project name ONE      NA
 4 stardust     21 Project name TWO     101
 5 soil         19 Project name TWO      NA
 6 sunflower    31 Project name TWO      NA
 7 juice        24 Project name TWO      NA
 8 doe           1 Project name TWO      NA
 9 tobacco       5 Project name TWO      NA
10 phi          21 Project name THREE    64
11 rho          21 Project name THREE    NA
12 omega        22 Project name THREE    NA

So what are we doing?

We define the project (or group) by taking the "Elements" for which Units are NA and then fill it downwards.
We also turn your Units column to numeric (in your example it is a character variable).
We then group by the project.
We filter out the first row because it contains the project name which now has its own column.
And then we calculate the sum of Units for each project and put taht into the first row of each project.

If you don't want to cut teh first row that contains the project name in "Elements" you can simply delete the code line with filter(...).

CodePudding user response：

You could also do:

df %>%
  type_convert() %>%
  group_by(grp = cumsum(is.na(Units))) %>%
  mutate(Sum = (NA^(row_number() != 1))* sum(Units, na.rm = TRUE))

# A tibble: 15 x 4
# Groups:   grp [3]
   Elements           Units   grp   Sum
   <chr>              <dbl> <int> <dbl>
 1 Project name ONE      NA     1    14
 2 John                   3     1    NA
 3 Smith                  5     1    NA
 4 Sara                   6     1    NA
 5 Project name TWO      NA     2   101
 6 stardust              21     2    NA
 7 soil                  19     2    NA
 8 sunflower             31     2    NA
 9 juice                 24     2    NA
10 doe                    1     2    NA
11 tobacco                5     2    NA
12 Project name THREE    NA     3    64
13 phi                   21     3    NA
14 rho                   21     3    NA
15 omega                 22     3    NA