Home > Blockchain >  How to split the characters of a string by spaces and then resultant elements of list by special cha
How to split the characters of a string by spaces and then resultant elements of list by special cha

Time:10-28

So, what I want to do is to convert some words from the string into their respective words in dictionary and rest as it is.For example by giving input as:

standarisationn("well-2-34 2   @$#beach bend com")

I want output as:

"well-2-34 2 @$#bch bnd com"

The codes I was using is:

def standarisationn(addr):
a=re.sub(',', ' ', addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
              "arcade":"arc",
               "apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
               "av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
              "beach":"bch",
              "bend":"bnd",
              "blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
              "boul":"blvd","boulevard":"blvd","boulv":"blvd",
              "bottm":"bot","bottom":"bot",
              "branch":"br","brnch":"br",
              "brdge":"brg","bridge":"brg",
              "bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
              "camp":"cmp",
              "canyn":"cny","canyon":"cny","cnyn":"cny",
              "southwest":"sw" ,"northwest":"nw"}

temp=re.findall(r"[A-Za-z0-9] |\S", a)
print(temp)
res = []
for wrd in temp:
     res.append(lookp_dict.get(wrd,wrd))
res = ' '.join(res)
return str(res) 

but its giving the wrong output as:

'well - 2 - 34 2 @ $ % 23beach bnd com'

that is with too many spaces and not even converting "beach" to "bch".So, that's the issue.What I thought is too first split the string by spaces and then split the resultant elements by special characters and numbers and the use the dictionary and then first join the separated strings by special characters without space and then all the list by space.Can anyone suggest how to go about this or any better method?

CodePudding user response:

You can build you regular expression with the keys of your dictionary, ensuring they're not enclosed in another word (i.e. not directly preceded nor followed by a letter):

import re
def standarisationn(addr):
    addr = re.sub(r'(,|\s )', " ", addr)
    lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
                "arcade":"arc",
                "apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
                "av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
                "beach":"bch",
                "bend":"bnd",
                "blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
                "boul":"blvd","boulevard":"blvd","boulv":"blvd",
                "bottm":"bot","bottom":"bot",
                "branch":"br","brnch":"br",
                "brdge":"brg","bridge":"brg",
                "bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
                "camp":"cmp",
                "canyn":"cny","canyon":"cny","cnyn":"cny",
                "southwest":"sw" ,"northwest":"nw"}

    for wrd in lookp_dict:
        addr = re.sub(rf'(?:^|(?<=[^a-zA-Z])){wrd}(?=[^a-zA-Z]|$)', lookp_dict[wrd], addr)
    return addr

print(standarisationn("well-2-34 2   @$#beach bend com"))

The expression is built in three parts:

  • ^ matches the beginning of the string
  • (?<=[^a-zA-Z]) is a lookbehind (ie a non capturing expression), checking that the preceding character is a letter
  • {wrd} is the key of your dictionary
  • (?=[^a-zA-Z]|$) is a lookahead (ie a non capturing expression), checking that the following character is a letter or the end of the string

Output:

well-2-34 2 @$#bch bnd com

Edit: you can compile a whole expression and use re.sub only once if you replace the loop with:

repl_pattern = re.compile(rf"(?:^|(?<=[^a-zA-Z]))({'|'.join(lookp_dict.keys())})(?=([^a-zA-Z]|$))")
addr = re.sub(repl_pattern, lambda x: lookp_dict[x.group(1)], addr)

This should be much faster if your dictionary grows because we build a single expression with all your dictionary keys:

  • ({'|'.join(lookp_dict.keys())}) is interpreted as (allee|alley|...
  • a lambda function in re.sub replaces the matching element with the corresponding value in lookp_dict (see for example this link for more details about this)
  • Related