How to split string after the last digit-CodePudding

To clean a dataset, I need to split a string after the last digit. Any idea ?

My dataframe:

data = {'addr':[
         "510 -1, Cleveland St", 
         "RC-20-5345 Poplar Street", 
         "3600 Race Avenue Richardson"]}

df = pd.DataFrame(data)

   addr
_____________________________________
   510 -1, Cleveland St
   RC-20-5345 Poplar Street
   3600 Race Avenue Richardson

I tried with this expression, but I missed floor number (RC) in the second row.

df["split1"] = df["addr"].str.extract(r"(\d [-\ ] \d*)")

  split1   | split2
___________|_________________________
510 -1     |  , Cleveland St
20-5345    |  Poplar Street
3600       |  Race Avenue Richardson

What I m looking for:

  split1   | split2
___________|_________________________
510 -1     |  , Cleveland St
RC-20-5345 |  Poplar Street
3600       |  Race Avenue Richardson

CodePudding user response：

what about just adding a wildcard match to the front of the regex?

df["split1"] = df["addr"].str.extract(r"(.*\d [-\ ] \d*)")

CodePudding user response：

def splitByLastDigit(x):
    lastDigit=0
    splitOne=""
    splitTwo=""
    finalArray=[]
    for i in range(0,len(x)):
        if x[i].isdigit() and i > lastDigit:
            lastDigit=i

    for i in range(0,len(x)):
        if i <= lastDigit:
            splitOne =x[i]
        else:
            splitTwo =x[i]
    finalArray.append(splitOne)
    finalArray.append(splitTwo)
    return finalArray

Just wrote up this solution. It is a bit rough (can definitely be done more elegant) but tested it with the three examples you provided and gets the job done.

Pretty simple idea. Collects the index of the last digit, then another loop checks which characters are before and after that index. Lastly, appends to it an array and returns the final results.

CodePudding user response：

To piggyback on zyd's answer, capture the remainder in another group

data = {'addr':[
         "510 -1, Cleveland St", 
         "RC-20-5345 Poplar Street", 
         "3600 Race Avenue Richardson"]}

df = pd.DataFrame(data)
df[['split1','split2']] = df["addr"].str.extract(r"(.*\d [-\ ] \d*)(. )")

                          addr       split1                  split2
0         510 -1, Cleveland St       510 -1          , Cleveland St
1     RC-20-5345 Poplar Street  RC-20-5345            Poplar Street
2  3600 Race Avenue Richardson        3600   Race Avenue Richardson