Home > Software design >  Why is my Regex search yielding more than expected and why isn't my loop removing certain dupli
Why is my Regex search yielding more than expected and why isn't my loop removing certain dupli

Time:11-05

I'm working on a program that parses weather information. In this part of my code, I am trying to re-organise the results in order of time before continuing to append more items later on.

The time in these lines is usually the first 4 digits of any line (first 2 digits are the day and the others are the hour). The exception to this is the line that starts with 11010KT, this line is always assumed to be the first line in any weather report, and those numbers are a wind vector and NOT a time.

You will see that I am removing any line that has TEMPO INTER or PROB at the start of this example because I want lines containing these words to be added to the end of the other restructured list. These lines can be thought of as a separate list in which I want organised by time in the same way as the other items.

I am trying to use Regex to pull the times from the lines that remain after removing the TEMPO INTER and PROB lines and then sort them, then once sorted, use regex again to find that line in full and create a restructured list. Once that list has been completed, I am sorting the TEMPO INTER and PROB list and then appending that to the newly completed list I had just made.

I have also tried a for loop that will remove any duplicate lines added, but this seems to only remove one duplicate of the TEMPO line???

Can someone please help me figure this out? I am kind of new to this, thank you...

This ideally should come back looking like this:

ETA IS 0230 which is 1430 local

11010KT 5000 MODERATE DRIZZLE BKN004
FM050200 12012KT 9999 LIGHT DRIZZLE BKN008 
TEMPO 0501/0502 2000 MODERATE DRIZZLE BKN002
INTER 0502/0506 4000 SHOWERS OF MODERATE RAIN BKN008

Instead of this, I am getting repeats of the line that starts with FM050200 and then repeats of the line starting with TEMPO. It doesn't find the line starting with INTER either...

I have made a minimal reproducible example for anyone to try and help me. I will include that here:

import re

total_print = ['\nFM050200 12012KT 9999 LIGHT DRIZZLE BKN008', '\n11010KT 5000 MODERATE DRIZZLE BKN004', '\nINTER 0502/0506 4000 SHOWERS OF MODERATE RAIN BKN008', '\nTEMPO 0501/0502 2000 MODERATE DRIZZLE BKN002']

removed_lines = []
for a in total_print:  # finding and removing lines with reference to TEMPO INTER PROB
    if 'TEMPO' in a:
        total_print.remove(a)
        removed_lines.append(a)
for b in total_print:
    if 'INTER' in b:
        total_print.remove(b)
        removed_lines.append(b)
for f in total_print:
    if 'PROB' in f:
        total_print.remove(f)
        removed_lines.append(f)

list_time_in_line = []
for line in total_print: # finding the times in the remaining lines
    time_in_line = re.search(r'\d\d\d\d', line)
    list_time_in_line.append(time_in_line.group())
sorted_time_list = sorted(list_time_in_line)

removed_time_in_line = []
for g in removed_lines:  # finding the times in the lines that were originally removed
    removed_times = re.search(r'\d\d\d\d', g)
    removed_time_in_line.append(removed_times.group())
sorted_removed_time_list = sorted(removed_time_in_line)


final = []
final.append('ETA IS 1230 which is 1430 local\n')  # appending the time display
search_for_first_line = re.search(r'[\n]\d\d\d\d\dKT', ' '.join(total_print))  # searching for line that has wind vector instead of time
search_for_first_line = search_for_first_line.group()

if search_for_first_line:  # adding wind vector line so that its the firs line listed in the group
    search_for_first_line = re.search(r'%s.*' % search_for_first_line, ' '.join(total_print)).group()
    final.append('\n'   search_for_first_line)

print(sorted_time_list)  # the list of possible times found (the second item in list is the wind vector and not a time)
d = 0
for c in sorted_time_list:  # finding the whole line for the corresponding time
    print(sorted_time_list[d])
    search_for_whole_line = re.search(r'.*\w \s*%s.*' % sorted_time_list[d], ' '.join(total_print))
    print(search_for_whole_line.group())  # it is doubling up on the 0502 time???????
    d  = 1
    final.append('\n'   str(search_for_whole_line.group()))

h = 0
for i in sorted_removed_time_list:  # finding the whole line for the corresponding times from the previously removed items
    whole_line_in_removed_srch = re.search(r'.*%s.*' % sorted_removed_time_list[h], ' '.join(removed_lines))
    h  = 1
    final.append('\n'   str(whole_line_in_removed_srch.group()))  # appending them

l_new = []
for item in final:  # this doesn't seeem to properly remove duplicates ?????
    if item not in l_new:
        l_new.append(item)
total_print = l_new

print(' '.join(total_print))

CodePudding user response:

just sort your data into a dict, you're always creating lists and removing items: it's too confusing.

your regex to catch the wind vector catches also 12012KT, that's why that line was repeated. the ^ ensures it matches only your pattern if it's a the beginning of the line

import re

total_print = ['\nFM050200 12012KT 9999 LIGHT DRIZZLE BKN008', '\n11010KT 5000 MODERATE DRIZZLE BKN004', '\nINTER 0502/0506 4000 SHOWERS OF MODERATE RAIN BKN008', '\nTEMPO 0501/0502 2000 MODERATE DRIZZLE BKN002']

data = {
    'windvector': [],
    'other': [],
    'tip': [] #tempo/inter/prob
}

wind_vector=re.compile('^\s\d{5}KT')
for line in total_print:
    if 'TEMPO' in line \
            or 'INTER' in line \
            or 'PROB' in line:
        key='tip'
    elif re.match(wind_vector, line):
        key='windvector'
    else:
        key='other'
    data[key].append(line)
        
data['tip']=sorted(data['tip'], key=lambda x: x.split()[1])
print('ETA IS 0230 which is 1430 local')
print()
for lst in data.values():
    for line in lst:
        print(line[1:]) #get rid of newline
  • Related