Given a file paths vector as follows, each one's file name containing 3 parts seperated by _
:
file_paths <- c("./test/foo_foo1foo2_2021-01.txt", "./test/foo_foo2foo1_2021-01.txt",
"./test/bar_bar2bar3_2021-01.txt", "./test/bar_bar3bar4_2021-01.txt",
"./test/bar_bar3bar2_2021-01.txt")
I hope to drop duplicated files when the second parts are identical: ie., we consider foo1foo2
and foo2foo1
, bar2bar3
and bar3bar2
are duplicated files.
Thus the final expected result will like file_paths_filtered
:
file_paths_filtered <- c("./test/foo_foo1foo2_2021-01.txt", "./test/bar_bar2bar3_2021-01.txt", "./test/bar_bar3bar4_2021-01.txt")
How could I acheive that using R? Thanks a lot at advance.
EDITS:
file_paths <- c('../raw_data/steel/steel_coke_spread_coke01coke05_2022-01-07.xlsx',
'../raw_data/steel/steel_coke_spread_coke01coke09_2022-01-07.xlsx',
'../raw_data/steel/steel_coke_spread_coke05coke01_2022-01-07.xlsx')
fp1 <- file_paths
fp2 <- file.path(dirname(fp1), sub("^([^_]*)_([^0-9_] [0-9] )([^0-9_] [0-9] )_(.*)$", "\\1_\\2_\\3_\\4\\5_\\6", basename(fp1)))
fp1[seq_along(fp1) <= match(fp1, fp2, length(fp1))]
Out:
[1] "../raw_data/steel/steel_coke_spread_coke01coke05_2022-01-07.xlsx"
[2] "../raw_data/steel/steel_coke_spread_coke01coke09_2022-01-07.xlsx"
[3] "../raw_data/steel/steel_coke_spread_coke05coke01_2022-01-07.xlsx"
Test with @Michel Dewar's solution:
out2 <- data.frame(file_paths) %>%
separate(file_paths, c('dir', 'file', 'ext'), sep = '_', remove = FALSE) %>%
separate(file, c('f1', 'f2'), sep = '(?<=\\d)(?=\\D)') %>%
mutate(f1_new = pmin(f1, f2),
f2 = pmax(f1, f2), f1 = f1_new, f1_new = NULL) %>%
distinct(f1, f2, .keep_all = TRUE) %>%
pull(file_paths)
print(out2)
Out:
Warning messages:
1: Expected 3 pieces. Additional pieces discarded in 3 rows [1, 2, 3].
2: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
> print(out2)
[1] "../raw_data/steel/steel_coke_spread_coke01coke05_2022-01-07.xlsx"
CodePudding user response:
Here is an approach using base R only. It uses sub
to reverse the first and second components of the substring between the underscores, then match
to identify file paths that are not duplicates (by your definition).
fp1 <- file_paths
## fp2 <- file.path(dirname(fp1), sub("^([^_]*)_([^0-9_] [0-9] )([^0-9_] [0-9] )_(.*)$", "\\1_\\3\\2_\\4", basename(fp1)))
fp2 <- file.path(dirname(fp1), sub("^(.*)_([^0-9_] [0-9] )([^0-9_] [0-9] )_([^_]*)$", "\\1_\\3\\2_\\4", basename(fp1)))
fp1[seq_along(fp1) <= match(fp1, fp2, length(fp1))]
[1] "./test/foo_foo1foo2_2021-01.txt"
[2] "./test/bar_bar2bar3_2021-01.txt"
[3] "./test/bar_bar3bar4_2021-01.txt"
CodePudding user response:
data.frame(file_paths) %>%
separate(file_paths, c('dir', 'file', 'ext'), sep = '_') %>%
separate(file, c('f1', 'f2'), sep = '(?<=\\d)(?=\\D)') %>%
mutate(f1_new = pmin(f1, f2),
f2 = pmax(f1, f2), f1 = f1_new, f1_new = NULL) %>%
unite('file') %>%
distinct()
file
1 ./test/foo_foo1_foo2_2021-01.txt
2 ./test/bar_bar2_bar3_2021-01.txt
3 ./test/bar_bar3_bar4_2021-01.txt
CodePudding user response:
A slight variation on @Onyambu's answer that keeps the original file names:
library(tidyverse)
file_paths <- c("./test/foo_foo1foo2_2021-01.txt", "./test/foo_foo2foo1_2021-01.txt",
"./test/bar_bar2bar3_2021-01.txt", "./test/bar_bar3bar4_2021-01.txt",
"./test/bar_bar3bar2_2021-01.txt")
data.frame(file_paths) %>%
separate(file_paths, c('dir', 'file', 'ext'), sep = '_', remove = FALSE) %>%
separate(file, c('f1', 'f2'), sep = '(?<=\\d)(?=\\D)') %>%
mutate(f1_new = pmin(f1, f2),
f2 = pmax(f1, f2), f1 = f1_new, f1_new = NULL) %>%
distinct(f1, f2, .keep_all = TRUE) %>%
pull(file_paths)
For your updated example you could use:
file_paths2 <- c('../raw_data/steel/steel_coke_spread_coke01coke05_2022-01-07.xlsx',
'../raw_data/steel/steel_coke_spread_coke01coke09_2022-01-07.xlsx',
'../raw_data/steel/steel_coke_spread_coke05coke01_2022-01-07.xlsx')
data.frame(file_paths2) %>%
separate(file_paths2, c('a', 'b', 'c', 'd', 'file', 'e'), sep = '_', remove = FALSE) %>%
separate(file, c('f1', 'f2'), sep = '(?<=\\d)(?=\\D)') %>%
mutate(f1_new = pmin(f1, f2),
f2 = pmax(f1, f2), f1 = f1_new, f1_new = NULL) %>%
distinct(f1, f2, .keep_all = TRUE) %>%
pull(file_paths2)
But your updated filenames are very different from your original filenames. It is impossible to have a method that will work for every conceivable filename structure. Whether you use regular expressions, or separate
, or fixed positions, or something else, your final method will have to be tailored to the actual filenames you have.