I have two sort of files, xml files
and txt files
. The files have a date in their name. If the date of the xml file
matches the date of a txt file
I want to open the txt file
do some processing and write the output to a list. After that I want to change the xml file
. Multiple xml files
can have the same date but the txt file
is unique so this means that more then 1 xml file
can be linked with a txt file.
Right now I have a problem. my to_csv
list contains data of both 20200907 and 20201025. I don't want it to work like that. I want my to_csv
list just do one file (and thus one date) at a time.
output_xml = r"c:\desktop\energy\XML_Output"
output_txt = r"c:\desktop\energy\TXT_Output"
xml_name = os.listdir(output_xml )
txt_name = os.listdir(output_txt)
txt_name = [x.replace('-', '') for x in txt_name] #remove the - in the filenames
# Extract the date from the xml and txt files.
xml_dates = []
for file in xml_name:
find = re.search("_(.\d )-", file).group(1)
xml_dates.append(find)
txt_dates = []
for file in txt_name:
find = re.search("MM(. ?)AB", file).group(1)
txt_dates.append(find)
#THIS IS SOME REPRODUCABLE OUTPUT FROM WHAT IS RECEIVED FROM ABOVE SNIPPET.
xml_dates = ['20200907', '20200908', '20201025', '20201025', '20201025', '20201025']
txt_dates = ['20200907', '20201025']
to_csv = []
for date_xml in xml_dates:
for date_txt in txt_dates:
if date_xml == date_txt:
match_txt = [s for s in txt_name if date_txt in s] # matching txt file
match_xml = [s for s in xml_name if date_xml in s] # matching xml file
match_txt_temp = match_txt[0]
match_txt_score = [match_txt_temp[:6] '-' match_txt_temp[6:8] '-' match_txt_temp[8:10] '-' match_txt_temp[10:12] match_txt_temp[12:]]
with open(output_txt "/" match_txt_score[0], "r") as outer:
reader = csv.reader(outer, delimiter="\t")
for row in reader:
read = [row for row in reader if row]
for row in read:
energy_level = row[20]
if energy_level > 250:
to_csv.append(row)
print(to_csv)
Current output:
[['1', '2', '3', '20200907', '4', '5'],
['1', '2', '3', '20200907', '4', '5'],
['1', '2', '3', '20200907', '4', '5'],
['1', '2', '3', '20201025, '4', '5'],
['1', '2', '3', '20201025, '4', '5']]
Desired output:
[[['1', '2', '3', '20200907', '4', '5'],
['1', '2', '3', '20200907', '4', '5'],
['1', '2', '3', '20200907', '4', '5']],
['1', '2', '3', '20201025, '4', '5'],
['1', '2', '3', '20201025, '4', '5']]
CodePudding user response:
You said that you have only one txt file by date and only want to process xml files if they are linked to a txt file. That means that one single loop over txt_dates is enough:
...
for date_txt in txt_dates:
date_xml = date_txt
match_txt = [s for s in txt_name if date_txt in s] # the matching txt file
match_xml = [s for s in xml_name if date_xml in s] # possible matching xml files
if len(match_xml) == 0: # no matching xml files
continue
match_txt_temp = match_txt[0]
match_txt_score = [match_txt_temp[:6] '-' match_txt_temp[6:8] '-'
match_txt_temp[8:10] '-' match_txt_temp[10:12]
match_txt_temp[12:]]
# prepare a new list for that date
curr = list()
with open(output_txt "/" match_txt_score[0], "r") as outer:
reader = csv.reader(outer, delimiter="\t")
for row in reader:
read = [row for row in reader if row]
for row in read:
energy_level = row[20]
if energy_level > 250:
curr.append(row)
if len(curr) > 0: # if the current date list is not empty append it
to_csv.append(curr)
print(to_csv)
BEWARE: as what you have provided is not a reproducible example I could not test the above code and typos are possible...
CodePudding user response:
You could append rows to a dictionary instead of an array to allow to keep rows separated using a key representing the dates. And after parsing the files you can create whatever list composition you want from the dictionary.
xml_dates = ['20200907', '20200908', '20201025', '20201025', '20201025', '20201025']
txt_dates = ['20200907', '20201025']
to_csv = {'20200907': [], '20201025':[]}
for date_xml in xml_dates:
for date_txt in txt_dates:
if date_xml == date_txt:
with open(output_t2m "/" match_t2m_score[0], "r") as outer:
reader = csv.reader(outer, delimiter="\t")
for row in reader:
read = [row for row in reader if row]
for row in read:
energy_level = row[20]
if energy_level > 250:
to_csv[date_txt].append(row)
final_csv = [to_csv['20200907'], to_csv['20201025']]
CodePudding user response:
Based on your update, and on this comment I can tell you that the following would be equivalent to what you're trying to do, though it doesn't seem very useful, because you're just duplicating the contents of the CSV files for each XML file with a matching date:
xml_file_re = re.compile(r'_(.\d )-')
xml_dates = defaultdict(int)
for filename in os.listdir(output_xml):
if m := re.search("_(.\d )-", file):
xml_dates[m.group(1)] = 1
txt_file_re = re.compile(r'MM(. ?)AB')
csv_by_date = []
for filename in os.listdir(output_txt):
if not m := txt_file_re.search(filename):
continue
date = m.group(1)
if date not in xml_dates:
continue
with open(os.path.join(output_txt, filename)) as fobj:
reader = csv.reader(fobj, delimiter='\t')
# Take only rows with energy_level > 250
rows = [row for row in reader if row[20] > 250]
# Make a list of copies of the row for each matching XML file
# Here we make duplicates of the rows just to be on the safe side...
copies = [rows[:] for _ in range(xml_dates[date])]
csv_by_date.append(copies)
Again, this is equivalent to what you seem to be wanting to do, but I'm not sure what it accomplishes for you (especially where the XML files come in...)
CodePudding user response:
If you loop through txt_dates
first then I think you can achieve your desired output. You can see that the dates in txt_date
are grouped together so you get one date at a time.
xml_dates = ['20200907', '20200908', '20201025', '20201025', '20201025', '20201025']
txt_dates = ['20200907', '20201025']
# sample csv data for each xml_date since we don't have the actual file contents
xml_rows = {
"20200907": ["a,b,c", "1,2,3", "10,11,12"],
"20201025": ["a,c,b", "7,8,9"]
}
to_csv = []
for date_txt in txt_dates:
# Filter xml_dates for those that match the current date_txt
xml_matches = [xd for xd in xml_dates if xd == date_txt]
print("txt date:", date_txt)
for date_xml in xml_matches:
print(" ", date_xml, end=" ")
# simulate rows in a csv file
file_rows = [row.split(",") [date_xml] for row in xml_rows[date_xml]]
to_csv.append(file_rows)
print()
print(to_csv)
Result:
txt date: 20200907
20200907
txt date: 20201025
20201025 20201025 20201025 20201025
[[['a', 'b', 'c', '20200907'],
['1', '2', '3', '20200907'],
['10', '11', '12', '20200907']],
[['a', 'c', 'b', '20201025'],
['7', '8', '9', '20201025']],
[['a', 'c', 'b', '20201025'],
['7', '8', '9', '20201025']],
[['a', 'c', 'b', '20201025'],
['7', '8', '9', '20201025']],
[['a', 'c', 'b', '20201025'],
['7', '8', '9', '20201025']]]
Edit: explanation of the file_rows
line
file_rows = [row.split(",") [date_xml] for row in xml_rows[date_xml]]
This is a list comprehension. The idea is to mimic processing a csv file. The xml_rows[date_xml]
is a list such as what might be created with
xml_rows = {}
date_xml = "2021-09-27"
with open("data.csv") as fd:
xml_rows[date_xml] = [line.strip() for line in fd]
where data.csv
contains
a,b,c
1,2,3
10,11,12
Note that there are more robust ways to process csv files, e.g. by using the csv library.
Given the xml_rows[date_xml]
list, the script then splits each row by comma with row.split(",")
. If we did
for row in xml_rows[date_xml]:
print(row, "=>", row.split(","))
The output would be
a,b,c => ['a', 'b', 'c']
1,2,3 => ['1', '2', '3']
10,11,12 => ['10', '11', '12']
To make it clear where these lines came from, I appended the date using row.split(",") [date_xml]
so the list would have the date that was being processed. So
for row in xml_rows[date_xml]:
print(row, "=>", row.split(",") ["2021-09-27"])
would produce
a,b,c => ['a', 'b', 'c', '2021-09-27']
1,2,3 => ['1', '2', '3', '2021-09-27']
Since I was mocking the input data anyway, this all might have been more clear with
xml_dates = ['20200907', '20200908', '20201025', '20201025', '20201025', '20201025']
txt_dates = ['20200907', '20201025']
# sample csv data for each xml_date since we don't have the actual file contents
xml_rows = {
"20200907": [["a", "b", "c"], ["1", "2", "3"], ["10", "11", "12"]],
"20201025": [["a", "c", "b"], ["7", "8", "9"]]
}
to_csv = []
for date_txt in txt_dates:
# Filter xml_dates for those that match the current date_txt
xml_matches = [xd for xd in xml_dates if xd == date_txt]
print("txt date:", date_txt)
for date_xml in xml_matches:
print(" ", date_xml, end=" ")
# Append the date currently being processed to the row
file_rows = [row [date_xml] for row in xml_rows[date_xml]]
to_csv.append(file_rows)
print()
print(to_csv)