Extract min and max information by sequential similar parts of data frame in R-CodePudding

I have a data frame that corresponds to the path taken by a river, describing elevation and distance. I need to evaluate each different ground path traveled by the river and extract this information.

Example:

df = data.frame(Soil = c("Forest", "Forest",
               "Grass", "Grass","Grass",
               "Scrub", "Scrub","Scrub","Scrub",
               "Grass", "Grass","Grass","Grass",
               "Forest","Forest","Forest","Forest","Forest","Forest"),
      Distance = c(1, 5, 
                   10, 15, 56,
                   59, 67, 89, 99,
                   102, 105, 130, 139,
                   143, 145, 167, 189, 190, 230),
      Elevation = c(1500, 1499,
                    1470, 1467, 1456,
                    1450, 1445, 1440, 1435,
                    1430, 1420, 1412, 1400,
                    1390, 1387, 1384, 1380, 1376, 1370))

Soil      Distance Elevation
1  Forest        1      1500
2  Forest        5      1499
3   Grass       10      1470
4   Grass       15      1467
5   Grass       56      1456
6   Scrub       59      1450
7   Scrub       67      1445
8   Scrub       89      1440
9   Scrub       99      1435
10  Grass      102      1430
11  Grass      105      1420
12  Grass      130      1412
13  Grass      139      1400
14 Forest      143      1390
15 Forest      145      1387
16 Forest      167      1384
17 Forest      189      1380
18 Forest      190      1376
19 Forest      230      1370

But i need to something like this:

    Soil Distance.Min Distance.Max Elevation.Min Elevation.Max
1 Forest            1            5          1499          1500
2  Grass           10           56          1456          1470
3  Scrub           59           99          1435          1450
4  Grass          102          139          1400          1430
5 Forest          143          230          1370          1390

I tried to use group_by() and which.min(Soil), but that takes into account the whole df, not each path.

CodePudding user response：

We need a run-length encoding to track consecutive Soil.

Using this function (fashioned to mimic data.table::rleid):

myrleid <- function (x) {
    r <- rle(x)
    rep(seq_along(r$lengths), times = r$lengths)
}

We can do

df %>%
  group_by(grp = myrleid(Soil)) %>%
  summarize(Soil = Soil[1], across(c(Distance, Elevation), list(min = min, max = max))) %>%
  select(-grp)
# # A tibble: 5 x 5
#   Soil   Distance_min Distance_max Elevation_min Elevation_max
#   <chr>         <dbl>        <dbl>         <dbl>         <dbl>
# 1 Forest            1            5          1499          1500
# 2 Grass            10           56          1456          1470
# 3 Scrub            59           99          1435          1450
# 4 Grass           102          139          1400          1430
# 5 Forest          143          230          1370          1390

CodePudding user response：

You can try this:

df = df %>% mutate(id=data.table::rleid(Soil))

inner_join(
  distinct(df %>% select(Soil,id)),  
  df %>% 
    group_by(id) %>% 
    summarize(across(Distance:Elevation, .fns = list("min" = min,"max"=max))),
  by="id"
) %>% select(!id)

Output:

    Soil Distance_min Distance_max Elevation_min Elevation_max
1 Forest            1            5          1499          1500
2  Grass           10           56          1456          1470
3  Scrub           59           99          1435          1450
4  Grass          102          139          1400          1430
5 Forest          143          230          1370          1390

Or, even more concise, thanks to r2evans.

df %>% 
  group_by(id = data.table::rleid(Soil)) %>% 
  summarize(Soil=first(Soil),across(Distance:Elevation, .fns = list("min" = min,"max"=max))) %>% 
  select(!id)