Home > Software engineering >  How to properly use a list comprehension in Python?
How to properly use a list comprehension in Python?

Time:09-28

I have two directories. Both of them contain TXT files. For every TXT file in dir_1 I'm trying to get the matching TXT file in dir_2. There is a match if the date (in the file name) of dir_1 matches with the date of files in dir_2.

dir_1 = c:\data\desktop\pointA\txt-files
dir_2 = c:\data\desktop\pointB\txt-files

dir_1 contains files like this (these directory can contain multiple files of same date, but different time):

AAA_C1_20200907-07u43m55s_ab.txt
AAA_T8_20200907-08u18m55s_ab.txt
FFF_T8_20201025-12u58m55s_ab.txt
FFF_T8_20201025-05u53m55s_ab.txt

dir_2 contains files like this (these files have a unique date):

MM2020-09-07-AB.dat
MM2020-10-25-AB.dat
MM2020-10-26-AB.dat

This is my code:

dir_1 = os.listdir(dir_1)
dir_2 = os.listdir(dir_2)

dir_2_dates = []
for file in dir_2:
    find = re.search("MM(. ?)AB", file).group(1)
    dir_2_dates.append(find)

for dir_1_file in dir_1:
    dir_1_date = re.search("_(.\d )-", dir_1_date ).group(1)
    if dir_1_date in dir_2_dates:
            
            matching_dir_2 = [file for file in dir_2 if file == dir_1_date ]
            print(matching_dir_2)

Desired output:

MM20200907AB.dat
MM20200907AB.dat
MM20201025AB.dat
MM20201025AB.dat

My current output is empty

CodePudding user response:

There are a few issues. Your first data match leaves the trailing slash. So this removes that. Next, you are matching with slashes to without, so this strips those. Last, I don't know why there is a comprehension in the last for loop. Here is a working implementation, but I'm unsure if it is what you're after, since your desired output lists 4 files that don't even exist in either list.

import re

DIR1 = [
    "AAA_C1_20200907-07u43m55s_ab.txt",
    "AAA_T8_20200907-08u18m55s_ab.txt",
    "FFF_T8_20201025-12u58m55s_ab.txt",
    "FFF_T8_20201025-05u53m55s_ab.txt",
]

DIR2 = [
    "MM2020-09-07-AB.dat",
    "MM2020-10-25-AB.dat",
    "MM2020-10-26-AB.dat",
]

dir_2_dates = []
for file in DIR2:
    find = re.search("MM(. ?)-AB", file).group(1).replace("-", "")
    dir_2_dates.append(find)

print([x for x in DIR1 if re.search("_(.\d )-", x).group(1) in dir_2_dates])

output:

[
    'AAA_C1_20200907-07u43m55s_ab.txt',
    'AAA_T8_20200907-08u18m55s_ab.txt',
    'FFF_T8_20201025-12u58m55s_ab.txt',
    'FFF_T8_20201025-05u53m55s_ab.txt'
]

CodePudding user response:

I'd suggest first first grouping the filenames from dir_1 in a dictionary, using the date as the key. This way, you do not have to search the entire dir_1 again for each file from dir_2. Then, you can just look up the files from dir_1 in that dict for any date in a file from dir_2. Make sure to normalize the dates though, to have the same format, i.e. remove the - from the dates in dir_2:

import re, collections

by_date = collections.defaultdict(list)
for f in dir_2:
    d = re.search("MM(. ?)-AB", f).group(1).replace("-", "")
    by_date[d].append(f)

for f in dir_1:
    d = re.search("_(.\d )-", f).group(1)
    if d in by_date:
        print(f, "->", by_date[d])

This does not use a list comprehension, but should be considerably faster.

CodePudding user response:

You are doing

[file for file in dir_2 if file == dir_1_date ]

where file is name of file in dir2, i.e.:

MM2020-09-07-AB.dat
MM2020-10-25-AB.dat
MM2020-10-26-AB.dat

and compare (==) it to dir_1_date which is

dir_1_date = re.search("_(.\d )-", dir_1_date ).group(1)

therefore dir_1_date is any character followed by one or more digits. There is no possibility for equalness as mentioned filenames consist of letters, digits, - and .

  • Related