How to select last n values per row in R?-CodePudding

I like to select the last 3 values that are not empty per row from my dataframe:

df <- structure(list(V1 = c("Johannes Gutenberg University of Mainz", 
"Eldagsener Str. 38", "Linneper Weg 1", "Gohrstraße 74", "Düppelstraße 36", 
"Blutspende: Haus A3"), V2 = c(" Gebäude 900", " 31832 Springe", 
" 40885 Ratingen", " 47475 Kamp-Lintfort", " 12163 Berlin", " Ebene -3"
), V3 = c(" Augustuspl. 4", " Germany", " Germany", " Germany", 
" Germany", " Zentrum Innere Medizin (ZIM Blutbank / Immunhämatologisches Labor Haus A1"
), V4 = c(" 55131 Mainz", "", "", "", "", " Zentrum Operative Medizin (ZOM"
), V5 = c(" Germany", "", "", "", "", " Oberdürrbacher Str. 6"
), V6 = c("", "", "", "", "", " 97080 Würzburg"), V7 = c("", 
"", "", "", "", " Germany")), row.names = 24:29, class = "data.frame")

Sometimes there is non-relevant text information at the beginning. And there are empty cells at the end. The crucial information is always the last 3 non-empty entries per row.

I'd like to go with tidyverse, but any other solution is also appreciated.

CodePudding user response：

Row wise loop - apply, remove blanks, get last 3 values:

data.frame(t(apply(df, 1, function(i){ tail(i[ i != "" ], 3) })))
#                        X1                   X2       X3
# 24          Augustuspl. 4          55131 Mainz  Germany
# 25     Eldagsener Str. 38        31832 Springe  Germany
# 26         Linneper Weg 1       40885 Ratingen  Germany
# 27          Gohrstraße 74  47475 Kamp-Lintfort  Germany
# 28        Düppelstraße 36         12163 Berlin  Germany
# 29  Oberdürrbacher Str. 6       97080 Würzburg  Germany

Note: if there is a blank in between values, it will be dropped, and the columns might not line up, for example, compare the first row:

df[1, 4] <- ""
data.frame(t(apply(df, 1, function(i){ tail(i[ i != "" ], 3)})))
#                        X1                   X2       X3
# 24            Gebäude 900        Augustuspl. 4  Germany
# 25     Eldagsener Str. 38        31832 Springe  Germany
# 26         Linneper Weg 1       40885 Ratingen  Germany
# 27          Gohrstraße 74  47475 Kamp-Lintfort  Germany
# 28        Düppelstraße 36         12163 Berlin  Germany
# 29  Oberdürrbacher Str. 6       97080 Würzburg  Germany

CodePudding user response：

library(tidyverse)

df |> 
  mutate(rn = row_number()) |> 
  pivot_longer(cols = V1:V7) |> 
  mutate(isem = value !="") |>
  filter(isem) |> 
  group_by(rn) |> 
  slice_tail(n=3) |> 
  select(-name, -isem) |> 
  mutate(rn = 1:3) |> 
  pivot_wider(names_from = rn, values_from = value) |> 
  unnest()

#> # A tibble: 6 × 3
#>   `1`                      `2`                    `3`       
#>   <chr>                    <chr>                  <chr>     
#> 1 " Augustuspl. 4"         " 55131 Mainz"         " Germany"
#> 2 "Eldagsener Str. 38"     " 31832 Springe"       " Germany"
#> 3 "Linneper Weg 1"         " 40885 Ratingen"      " Germany"
#> 4 "Gohrstraße 74"          " 47475 Kamp-Lintfort" " Germany"
#> 5 "Düppelstraße 36"        " 12163 Berlin"        " Germany"
#> 6 " Oberdürrbacher Str. 6" " 97080 Würzburg"      " Germany"

CodePudding user response：

Using tidyr and dplyr:

library(dplyr)
library(tidyr)

df %>% 
  mutate(row = row_number()) %>%
  pivot_longer(!row) %>%
  filter(value != "") %>%
  group_by(row) %>%
  slice_tail(n = 3) %>%
  mutate(name = paste0("V", 1:3)) %>%
  ungroup() %>%
  pivot_wider()

# A tibble: 6 × 4
    row V1                       V2                     V3        
  <int> <chr>                    <chr>                  <chr>     
1     1 " Augustuspl. 4"         " 55131 Mainz"         " Germany"
2     2 "Eldagsener Str. 38"     " 31832 Springe"       " Germany"
3     3 "Linneper Weg 1"         " 40885 Ratingen"      " Germany"
4     4 "Gohrstraße 74"          " 47475 Kamp-Lintfort" " Germany"
5     5 "Düppelstraße 36"        " 12163 Berlin"        " Germany"
6     6 " Oberdürrbacher Str. 6" " 97080 Würzburg"      " Germany"

CodePudding user response：

Assuming that yours last three columns are named "V5", "V6" and "V7" like in the example, you could use filter in a one line command

 filter(df,V5!= "",V6 != "",V7 != "")

and if you only need these columns you could do

df |> 
  select(V5:V7) |> 
  filter(V5!= "",V6 != "",V7 != "")