Home > database >  Multiple split conditions
Multiple split conditions

Time:11-12

I have a table with a street number field that should have been all numeric values but the creators of the dataset allowed invalid values like 567M or 4321.5 to indicate an apartment number or granny unit. I need to get whole numbers into a new field and place the letter values and decimal values into a street suffix field. I've been playing around with regex and isalpha() and isalnum():

# import regex
import re

# list of values that should be all numbers
lst = ['1234', '4321.5', '567M']

# create new lists where l1 will be numeric datatypes and l2 will contain suffixes
# if there is a decimal symbol the string should be split with first value being all numbers
# and going into l1 and everything after decimal in l2
l1 = []
l2 =[]

for i in lst:
    if i.isalnum(): # all characters are numeric and good to go. Maybe need to do int()? 
        l1.append(i)
    elif '.' in i: # a decimal was found and values need to to be split and placed into two different lists
        i.split(".")
        l1.append(i[0])
        l2.append(i[-1])
    else:
        if i.isnumeric() == False: # if a letter is found in a list item everything prior to letter goes to l1 and letter goes to l2
            i = re.split('(\d )', i)
            l1.append(i[0])
            l2.append(i[-1])

I immediately got this back when running the code:

['4321', '5']

And then got this for l1 and l2 (l1 being new numeric list and l2 being string suffix list):

['4321', '5']
l1
['1234', '4', '567M']
l2
['5']

Am I headed in the right direction here? I was hoping this would be simpler but the data is pretty wonky.

CodePudding user response:

Your first issue is that i.split(".") doesn't do anything useful by itself, it returns a list with the split pieces. You're not using that (and it seems like your interpreter is printing it out for you), and you go on to index into the original, unsplit string. You could use i = i.split(".") to fix that part of your code, but below I suggest a better approach.

The other issue is that your regex splitting code isn't working either. Rather than try to fix it, I'd suggest using an entirely different regex approach, which will also handle the other cases you're looking at:

for i in lst:
    match = re.match(r'(\d )\.?(.*)', i)
    if match is None:
        raise ValueError("invalid address format {i}")
    l1.append(match.group(1))
    l2.append(match.group(2))

This uses a regex pattern to match the two parts of the street number you want. Any leading numeric part gets matched by the first capturing group and put into l1, and whatever else is trailing (optionally after a decimal point) goes into l2. The second part may be an empty string, if there is no extra part to the street number, but I still put it into l2, which you'll appreciate when you try to iterate over l1 and l2 together.

CodePudding user response:

You can just use a regex to do the pattern matching. The expression would consist of three parts:

  1. the street number: [1-9]\d*
  2. a optional delimiter, e.g .,- : [\.,\- ]*
  3. the suffix (any chars except whitespaces): [^\s]*
import re

pattern = re.compile(r'(?P<number>[1-9]\d*)[\.,\- ]*(?P<suffix>[^\s]*)')

list = ['1234', '4321.5', '567M', ' 31-a ']
l1 = []
l2 = []
for item in list:
    match = pattern.search(item)
    if match:
        l1.append(int(match.group('number')))
        l2.append(match.group('suffix'))
    
print(f'l1: {l1}')
print(f'l2: {l2}')

Result:

l1: [1234, 4321, 567, 31]
l2: ['', '5', 'M', 'a']
  • Related