Home > Software design >  How to create a new column based on an untidy data?
How to create a new column based on an untidy data?

Time:12-07

Im working with a dataset where some of the table elements are mixed. Ideally, rows from the location column containing the word "District", i.e., "District 1", "District 2" etc.., should be in a separate column from the location column.

A snippet of the dataset looks like this:

location     iri length
   <chr>      <dbl>  <dbl>
 1 District 1 NA        NA
 2 S00180CB    4.60  32171
 3 S00186CB    5.17  31286
 4 S00188CB    6.02   4742
 5 District 2 NA        NA
 6 S00136CB    4.91   5968
 7 S00285CB    4.33  29340
 8 S00288CB   10.9     141
 9 S00289CB    4.41   9126
10 District 3 NA        NA
11 S00231CB    4.34   6895
12 S00266CB    5.65  18985
13 S00381CB    4.39  13799
14 S00382CB    8.96    124
15 District 4 NA        NA
16 S00303CB    4.11  79599
17 S00311CB    3.17    625
18 District 5 NA        NA
19 S00163CB    3.49  17996
20 S00253CB    2.65    905
21 S00259CB    2.26    905

Code i tried:

library(tidyverse)

df |> 
  mutate(district = ifelse(str_detect(location, "District"), 
                           rep_len("District 1", length.out = n()), "none")

I expected it to come out like this:

district   location   iri length
   <chr>      <chr>    <dbl>  <dbl>
 1 District 1 S00180CB  4.60  32171
 2 District 1 S00186CB  5.17  31286
 3 District 1 S00188CB  6.02   4742
 4 District 2 S00136CB  4.91   5968
 5 District 2 S00285CB  4.33  29340
 6 District 2 S00288CB 10.9     141
 7 District 2 S00289CB  4.41   9126
 8 District 3 S00231CB  4.34   6895
 9 District 3 S00266CB  5.65  18985
10 District 3 S00381CB  4.39  13799
11 District 3 S00382CB  8.96    124
12 District 4 S00303CB  4.11  79599
13 District 4 S00311CB  3.17    625
14 District 5 S00163CB  3.49  17996
15 District 5 S00253CB  2.65    905
16 District 5 S00259CB  2.26    905

but the result did not come out as expected.

Is there a more efficient way to accomplish this?

CodePudding user response:

You could use mutate to create the District column, then fill to replace the missing values, then omit any rows where values are NA.

library(dplyr)
library(tidyr)

df1 %>% 
  mutate(District = ifelse(str_detect(location, "District"), location, NA)) %>%
  fill(District) %>% 
  na.omit()

Result:

   location   iri length   District
2  S00180CB  4.60  32171 District 1
3  S00186CB  5.17  31286 District 1
4  S00188CB  6.02   4742 District 1
6  S00136CB  4.91   5968 District 2
7  S00285CB  4.33  29340 District 2
8  S00288CB 10.90    141 District 2
9  S00289CB  4.41   9126 District 2
11 S00231CB  4.34   6895 District 3
12 S00266CB  5.65  18985 District 3
13 S00381CB  4.39  13799 District 3
14 S00382CB  8.96    124 District 3
16 S00303CB  4.11  79599 District 4
17 S00311CB  3.17    625 District 4
19 S00163CB  3.49  17996 District 5
20 S00253CB  2.65    905 District 5
21 S00259CB  2.26    905 District 5

The data. Note: named df1 not df, which is bad practice.

df1 <- structure(list(location = c("District 1", "S00180CB", "S00186CB", 
"S00188CB", "District 2", "S00136CB", "S00285CB", "S00288CB", 
"S00289CB", "District 3", "S00231CB", "S00266CB", "S00381CB", 
"S00382CB", "District 4", "S00303CB", "S00311CB", "District 5", 
"S00163CB", "S00253CB", "S00259CB"), iri = c(NA, 4.6, 5.17, 6.02, 
NA, 4.91, 4.33, 10.9, 4.41, NA, 4.34, 5.65, 4.39, 8.96, NA, 4.11, 
3.17, NA, 3.49, 2.65, 2.26), length = c(NA, 32171L, 31286L, 4742L, 
NA, 5968L, 29340L, 141L, 9126L, NA, 6895L, 18985L, 13799L, 124L, 
NA, 79599L, 625L, NA, 17996L, 905L, 905L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21"))

CodePudding user response:

Here is an alternative dplyr way with grouping and filtering:

library(dplyr)

df %>% 
  group_by(group = cumsum(grepl('District', location))) %>% 
  mutate(District = first(location)) %>% 
  filter(!grepl('District', location)) %>% 
  ungroup() %>% 
  select(-group)
location   iri length District  
   <chr>    <dbl>  <int> <chr>     
 1 S00180CB  4.6   32171 District 1
 2 S00186CB  5.17  31286 District 1
 3 S00188CB  6.02   4742 District 1
 4 S00136CB  4.91   5968 District 2
 5 S00285CB  4.33  29340 District 2
 6 S00288CB 10.9     141 District 2
 7 S00289CB  4.41   9126 District 2
 8 S00231CB  4.34   6895 District 3
 9 S00266CB  5.65  18985 District 3
10 S00381CB  4.39  13799 District 3
11 S00382CB  8.96    124 District 3
12 S00303CB  4.11  79599 District 4
13 S00311CB  3.17    625 District 4
14 S00163CB  3.49  17996 District 5
15 S00253CB  2.65    905 District 5
16 S00259CB  2.26    905 District 5
  • Related