I have two directories. Both of them contain TXT files. For every TXT file in dir_1
I'm trying to get the matching TXT file in dir_2
. There is a match if the date (in the file name) of dir_1
matches with the date of files in dir_2
.
dir_1 = c:\data\desktop\pointA\txt-files
dir_2 = c:\data\desktop\pointB\txt-files
dir_1
contains files like this (these directory can contain multiple files of same date, but different time):
AAA_C1_20200907-07u43m55s_ab.txt
AAA_T8_20200907-08u18m55s_ab.txt
FFF_T8_20201025-12u58m55s_ab.txt
FFF_T8_20201025-05u53m55s_ab.txt
dir_2
contains files like this (these files have a unique date):
MM2020-09-07-AB.dat
MM2020-10-25-AB.dat
MM2020-10-26-AB.dat
This is my code:
dir_1 = os.listdir(dir_1)
dir_2 = os.listdir(dir_2)
dir_2_dates = []
for file in dir_2:
find = re.search("MM(. ?)AB", file).group(1)
dir_2_dates.append(find)
for dir_1_file in dir_1:
dir_1_date = re.search("_(.\d )-", dir_1_date ).group(1)
if dir_1_date in dir_2_dates:
matching_dir_2 = [file for file in dir_2 if file == dir_1_date ]
print(matching_dir_2)
Desired output:
MM20200907AB.dat
MM20200907AB.dat
MM20201025AB.dat
MM20201025AB.dat
My current output is empty
CodePudding user response:
There are a few issues. Your first data match leaves the trailing slash. So this removes that. Next, you are matching with slashes to without, so this strips those. Last, I don't know why there is a comprehension in the last for loop. Here is a working implementation, but I'm unsure if it is what you're after, since your desired output lists 4 files that don't even exist in either list.
import re
DIR1 = [
"AAA_C1_20200907-07u43m55s_ab.txt",
"AAA_T8_20200907-08u18m55s_ab.txt",
"FFF_T8_20201025-12u58m55s_ab.txt",
"FFF_T8_20201025-05u53m55s_ab.txt",
]
DIR2 = [
"MM2020-09-07-AB.dat",
"MM2020-10-25-AB.dat",
"MM2020-10-26-AB.dat",
]
dir_2_dates = []
for file in DIR2:
find = re.search("MM(. ?)-AB", file).group(1).replace("-", "")
dir_2_dates.append(find)
print([x for x in DIR1 if re.search("_(.\d )-", x).group(1) in dir_2_dates])
output:
[
'AAA_C1_20200907-07u43m55s_ab.txt',
'AAA_T8_20200907-08u18m55s_ab.txt',
'FFF_T8_20201025-12u58m55s_ab.txt',
'FFF_T8_20201025-05u53m55s_ab.txt'
]
CodePudding user response:
I'd suggest first first grouping the filenames from dir_1
in a dictionary, using the date as the key. This way, you do not have to search the entire dir_1
again for each file from dir_2
.
Then, you can just look up the files from dir_1
in that dict for any date in a file from dir_2
. Make sure to normalize the dates though, to have the same format, i.e. remove the -
from the dates in dir_2
:
import re, collections
by_date = collections.defaultdict(list)
for f in dir_2:
d = re.search("MM(. ?)-AB", f).group(1).replace("-", "")
by_date[d].append(f)
for f in dir_1:
d = re.search("_(.\d )-", f).group(1)
if d in by_date:
print(f, "->", by_date[d])
This does not use a list comprehension, but should be considerably faster.
CodePudding user response:
You are doing
[file for file in dir_2 if file == dir_1_date ]
where file is name of file in dir2, i.e.:
MM2020-09-07-AB.dat
MM2020-10-25-AB.dat
MM2020-10-26-AB.dat
and compare (==
) it to dir_1_date
which is
dir_1_date = re.search("_(.\d )-", dir_1_date ).group(1)
therefore dir_1_date
is any character followed by one or more digits. There is no possibility for equalness as mentioned filenames consist of letters, digits, -
and .