How to process files based on their date in Python?-CodePudding

I have two sort of files, xml files and txt files. The files have a date in their name. If the date of the xml file matches the date of a txt file I want to open the txt file do some processing and write the output to a list. After that I want to change the xml file. Multiple xml files can have the same date but the txt file is unique so this means that more then 1 xml file can be linked with a txt file.

Right now I have a problem. my to_csv list contains data of both 20200907 and 20201025. I don't want it to work like that. I want my to_csv list just do one file (and thus one date) at a time.

output_xml = r"c:\desktop\energy\XML_Output"
output_txt = r"c:\desktop\energy\TXT_Output"

xml_name = os.listdir(output_xml )
txt_name = os.listdir(output_txt)
txt_name = [x.replace('-', '') for x in txt_name] #remove the - in the filenames

# Extract the date from the xml and txt files. 
xml_dates = []
for file in xml_name:
    find = re.search("_(.\d )-", file).group(1)
    xml_dates.append(find)

txt_dates = []
for file in txt_name:
    find = re.search("MM(. ?)AB", file).group(1)
    txt_dates.append(find)

#THIS IS SOME REPRODUCABLE OUTPUT FROM WHAT IS RECEIVED FROM ABOVE SNIPPET.
xml_dates = ['20200907', '20200908', '20201025', '20201025', '20201025', '20201025']
txt_dates = ['20200907', '20201025']

to_csv = []

for date_xml in xml_dates:
    for date_txt in txt_dates:
        if date_xml == date_txt:

              match_txt = [s for s in txt_name if date_txt in s]  # matching txt file  
              match_xml = [s for s in xml_name if date_xml in s]  # matching xml file

              match_txt_temp = match_txt[0]
              match_txt_score = [match_txt_temp[:6] '-' match_txt_temp[6:8] '-' match_txt_temp[8:10] '-' match_txt_temp[10:12] match_txt_temp[12:]]

              with open(output_txt   "/"   match_txt_score[0], "r") as outer:
                reader = csv.reader(outer, delimiter="\t")  

                for row in reader:
                    read = [row for row in reader if row]
                    for row in read:
  
                        energy_level = row[20]

                        if energy_level > 250:
                            to_csv.append(row)
                            
print(to_csv)

Current output:

[['1', '2', '3', '20200907', '4', '5'], 
['1', '2', '3', '20200907', '4', '5'], 
['1', '2', '3', '20200907', '4', '5'], 
['1', '2', '3', '20201025, '4', '5'], 
['1', '2', '3', '20201025, '4', '5']]

Desired output:

[[['1', '2', '3', '20200907', '4', '5'], 
['1', '2', '3', '20200907', '4', '5'], 
['1', '2', '3', '20200907', '4', '5']], 
['1', '2', '3', '20201025, '4', '5'], 
['1', '2', '3', '20201025, '4', '5']]

CodePudding user response：

You said that you have only one txt file by date and only want to process xml files if they are linked to a txt file. That means that one single loop over txt_dates is enough:

...
for date_txt in txt_dates:
    date_xml = date_txt

    match_txt = [s for s in txt_name if date_txt in s]  # the matching txt file  
    match_xml = [s for s in xml_name if date_xml in s]  # possible matching xml files
    if len(match_xml) == 0:   # no matching xml files
        continue

    match_txt_temp = match_txt[0]
    match_txt_score = [match_txt_temp[:6] '-' match_txt_temp[6:8] '-'
                        match_txt_temp[8:10] '-' match_txt_temp[10:12]
                        match_txt_temp[12:]]

    # prepare a new list for that date
    curr = list()

    with open(output_txt   "/"   match_txt_score[0], "r") as outer:
        reader = csv.reader(outer, delimiter="\t")  

        for row in reader:
            read = [row for row in reader if row]
            for row in read:
                energy_level = row[20]
                if energy_level > 250:
                    curr.append(row)

    if len(curr) > 0:    # if the current date list is not empty append it
        to_csv.append(curr)
                        
print(to_csv)

BEWARE: as what you have provided is not a reproducible example I could not test the above code and typos are possible...

CodePudding user response：

You could append rows to a dictionary instead of an array to allow to keep rows separated using a key representing the dates. And after parsing the files you can create whatever list composition you want from the dictionary.

xml_dates = ['20200907', '20200908', '20201025', '20201025', '20201025', '20201025']
txt_dates = ['20200907', '20201025']

to_csv = {'20200907': [], '20201025':[]}

for date_xml in xml_dates:
    for date_txt in txt_dates:
        if date_xml == date_txt:
             with open(output_t2m   "/"   match_t2m_score[0], "r") as outer:
                reader = csv.reader(outer, delimiter="\t")  

                for row in reader:
                    read = [row for row in reader if row]
                    for row in read:
  
                        energy_level = row[20]

                        if energy_level > 250:
                            to_csv[date_txt].append(row)

final_csv = [to_csv['20200907'], to_csv['20201025']]

CodePudding user response：

Based on your update, and on this comment I can tell you that the following would be equivalent to what you're trying to do, though it doesn't seem very useful, because you're just duplicating the contents of the CSV files for each XML file with a matching date:

xml_file_re = re.compile(r'_(.\d )-')
xml_dates = defaultdict(int)
for filename in os.listdir(output_xml):
    if m := re.search("_(.\d )-", file):
        xml_dates[m.group(1)]  = 1

txt_file_re = re.compile(r'MM(. ?)AB')
csv_by_date = []

for filename in os.listdir(output_txt):
    if not m := txt_file_re.search(filename):
        continue

    date = m.group(1)

    if date not in xml_dates:
        continue

    with open(os.path.join(output_txt, filename)) as fobj:
        reader = csv.reader(fobj, delimiter='\t')
        # Take only rows with energy_level > 250
        rows = [row for row in reader if row[20] > 250]
        # Make a list of copies of the row for each matching XML file
        # Here we make duplicates of the rows just to be on the safe side...
        copies = [rows[:] for _ in range(xml_dates[date])]
        csv_by_date.append(copies)

Again, this is equivalent to what you seem to be wanting to do, but I'm not sure what it accomplishes for you (especially where the XML files come in...)

CodePudding user response：

If you loop through txt_dates first then I think you can achieve your desired output. You can see that the dates in txt_date are grouped together so you get one date at a time.

xml_dates = ['20200907', '20200908', '20201025', '20201025', '20201025', '20201025']
txt_dates = ['20200907', '20201025']

# sample csv data for each xml_date since we don't have the actual file contents
xml_rows = {
    "20200907": ["a,b,c", "1,2,3", "10,11,12"],
    "20201025": ["a,c,b", "7,8,9"]
}

to_csv = []

for date_txt in txt_dates:
    # Filter xml_dates for those that match the current date_txt
    xml_matches = [xd for xd in xml_dates if xd == date_txt]
    print("txt date:", date_txt)
    for date_xml in xml_matches:
        print("    ", date_xml, end=" ")
        # simulate rows in a csv file
        file_rows = [row.split(",")   [date_xml] for row in xml_rows[date_xml]]
        to_csv.append(file_rows)
    print()

print(to_csv)

Result:

txt date: 20200907
     20200907
txt date: 20201025
     20201025      20201025      20201025      20201025
[[['a', 'b', 'c', '20200907'],
  ['1', '2', '3', '20200907'],
  ['10', '11', '12', '20200907']],
 [['a', 'c', 'b', '20201025'],
  ['7', '8', '9', '20201025']],
 [['a', 'c', 'b', '20201025'],
  ['7', '8', '9', '20201025']],
 [['a', 'c', 'b', '20201025'],
  ['7', '8', '9', '20201025']],
 [['a', 'c', 'b', '20201025'],
  ['7', '8', '9', '20201025']]]

Edit: explanation of the file_rows line

file_rows = [row.split(",")   [date_xml] for row in xml_rows[date_xml]]

This is a list comprehension. The idea is to mimic processing a csv file. The xml_rows[date_xml] is a list such as what might be created with

xml_rows = {}
date_xml = "2021-09-27"
with open("data.csv") as fd:
   xml_rows[date_xml] = [line.strip() for line in fd]

where data.csv contains

a,b,c
1,2,3
10,11,12

Note that there are more robust ways to process csv files, e.g. by using the csv library.

Given the xml_rows[date_xml] list, the script then splits each row by comma with row.split(","). If we did

for row in xml_rows[date_xml]:
    print(row, "=>", row.split(","))

The output would be

a,b,c => ['a', 'b', 'c']
1,2,3 => ['1', '2', '3']
10,11,12 => ['10', '11', '12']

To make it clear where these lines came from, I appended the date using row.split(",") [date_xml] so the list would have the date that was being processed. So

for row in xml_rows[date_xml]:
    print(row, "=>", row.split(",")   ["2021-09-27"])

would produce

a,b,c => ['a', 'b', 'c', '2021-09-27']
1,2,3 => ['1', '2', '3', '2021-09-27']

Since I was mocking the input data anyway, this all might have been more clear with

xml_dates = ['20200907', '20200908', '20201025', '20201025', '20201025', '20201025']
txt_dates = ['20200907', '20201025']

# sample csv data for each xml_date since we don't have the actual file contents
xml_rows = {
    "20200907": [["a", "b", "c"], ["1", "2", "3"], ["10", "11", "12"]],
    "20201025": [["a", "c", "b"], ["7", "8", "9"]]
}

to_csv = []

for date_txt in txt_dates:
    # Filter xml_dates for those that match the current date_txt
    xml_matches = [xd for xd in xml_dates if xd == date_txt]
    print("txt date:", date_txt)
    for date_xml in xml_matches:
        print("    ", date_xml, end=" ")
        # Append the date currently being processed to the row
        file_rows = [row   [date_xml] for row in xml_rows[date_xml]]
        to_csv.append(file_rows)
    print()

print(to_csv)