Remove elements from vector as part of string are identical after exchanging position using R-CodePudding

Given a file paths vector as follows, each one's file name containing 3 parts seperated by _:

file_paths <- c("./test/foo_foo1foo2_2021-01.txt", "./test/foo_foo2foo1_2021-01.txt",
               "./test/bar_bar2bar3_2021-01.txt", "./test/bar_bar3bar4_2021-01.txt",
               "./test/bar_bar3bar2_2021-01.txt")

I hope to drop duplicated files when the second parts are identical: ie., we consider foo1foo2 and foo2foo1, bar2bar3 and bar3bar2 are duplicated files.

Thus the final expected result will like file_paths_filtered:

file_paths_filtered <- c("./test/foo_foo1foo2_2021-01.txt", "./test/bar_bar2bar3_2021-01.txt", "./test/bar_bar3bar4_2021-01.txt")

How could I acheive that using R? Thanks a lot at advance.

EDITS:

file_paths <- c('../raw_data/steel/steel_coke_spread_coke01coke05_2022-01-07.xlsx',
  '../raw_data/steel/steel_coke_spread_coke01coke09_2022-01-07.xlsx',
  '../raw_data/steel/steel_coke_spread_coke05coke01_2022-01-07.xlsx')

fp1 <- file_paths
fp2 <- file.path(dirname(fp1), sub("^([^_]*)_([^0-9_] [0-9] )([^0-9_] [0-9] )_(.*)$", "\\1_\\2_\\3_\\4\\5_\\6", basename(fp1)))
fp1[seq_along(fp1) <= match(fp1, fp2, length(fp1))]

Out:

[1] "../raw_data/steel/steel_coke_spread_coke01coke05_2022-01-07.xlsx"
[2] "../raw_data/steel/steel_coke_spread_coke01coke09_2022-01-07.xlsx"
[3] "../raw_data/steel/steel_coke_spread_coke05coke01_2022-01-07.xlsx"

Test with @Michel Dewar's solution:

out2 <- data.frame(file_paths) %>%
  separate(file_paths, c('dir', 'file', 'ext'), sep = '_', remove = FALSE) %>%
  separate(file, c('f1', 'f2'), sep = '(?<=\\d)(?=\\D)') %>%
  mutate(f1_new = pmin(f1, f2), 
         f2 = pmax(f1, f2), f1 = f1_new, f1_new = NULL) %>%
  distinct(f1, f2, .keep_all = TRUE) %>% 
  pull(file_paths)
print(out2)

Out:

Warning messages:
1: Expected 3 pieces. Additional pieces discarded in 3 rows [1, 2, 3]. 
2: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3]. 
> print(out2)
[1] "../raw_data/steel/steel_coke_spread_coke01coke05_2022-01-07.xlsx"

CodePudding user response：

Here is an approach using base R only. It uses sub to reverse the first and second components of the substring between the underscores, then match to identify file paths that are not duplicates (by your definition).

fp1 <- file_paths
## fp2 <- file.path(dirname(fp1), sub("^([^_]*)_([^0-9_] [0-9] )([^0-9_] [0-9] )_(.*)$", "\\1_\\3\\2_\\4", basename(fp1)))
fp2 <- file.path(dirname(fp1), sub("^(.*)_([^0-9_] [0-9] )([^0-9_] [0-9] )_([^_]*)$", "\\1_\\3\\2_\\4", basename(fp1)))
fp1[seq_along(fp1) <= match(fp1, fp2, length(fp1))]

[1] "./test/foo_foo1foo2_2021-01.txt"
[2] "./test/bar_bar2bar3_2021-01.txt"
[3] "./test/bar_bar3bar4_2021-01.txt"

CodePudding user response：

data.frame(file_paths) %>%
  separate(file_paths, c('dir', 'file', 'ext'), sep = '_') %>%
  separate(file, c('f1', 'f2'), sep = '(?<=\\d)(?=\\D)') %>%
  mutate(f1_new = pmin(f1, f2), 
         f2 = pmax(f1, f2), f1 = f1_new, f1_new = NULL) %>%
  unite('file') %>%
  distinct()

 file
1 ./test/foo_foo1_foo2_2021-01.txt
2 ./test/bar_bar2_bar3_2021-01.txt
3 ./test/bar_bar3_bar4_2021-01.txt

CodePudding user response：

A slight variation on @Onyambu's answer that keeps the original file names:

library(tidyverse)

file_paths <- c("./test/foo_foo1foo2_2021-01.txt", "./test/foo_foo2foo1_2021-01.txt",
                "./test/bar_bar2bar3_2021-01.txt", "./test/bar_bar3bar4_2021-01.txt",
                "./test/bar_bar3bar2_2021-01.txt")

data.frame(file_paths) %>%
    separate(file_paths, c('dir', 'file', 'ext'), sep = '_', remove = FALSE) %>%
    separate(file, c('f1', 'f2'), sep = '(?<=\\d)(?=\\D)') %>%
    mutate(f1_new = pmin(f1, f2), 
           f2 = pmax(f1, f2), f1 = f1_new, f1_new = NULL) %>%
    distinct(f1, f2, .keep_all = TRUE) %>% 
    pull(file_paths)

For your updated example you could use:

file_paths2 <- c('../raw_data/steel/steel_coke_spread_coke01coke05_2022-01-07.xlsx',
                '../raw_data/steel/steel_coke_spread_coke01coke09_2022-01-07.xlsx',
                '../raw_data/steel/steel_coke_spread_coke05coke01_2022-01-07.xlsx')

data.frame(file_paths2) %>%
    separate(file_paths2, c('a', 'b', 'c', 'd', 'file', 'e'), sep = '_', remove = FALSE) %>%
    separate(file, c('f1', 'f2'), sep = '(?<=\\d)(?=\\D)') %>%
    mutate(f1_new = pmin(f1, f2), 
           f2 = pmax(f1, f2), f1 = f1_new, f1_new = NULL) %>%
    distinct(f1, f2, .keep_all = TRUE) %>% 
    pull(file_paths2)

But your updated filenames are very different from your original filenames. It is impossible to have a method that will work for every conceivable filename structure. Whether you use regular expressions, or separate, or fixed positions, or something else, your final method will have to be tailored to the actual filenames you have.