I am an R User that is trying to learn more about Python.
I found this Python library that I would like to use for address parsing: https://github.com/zehengl/ez-address-parser
I was able to try an example over here:
from ez_address_parser import AddressParser
ap = AddressParser()
result = ap.parse("290 Bremner Blvd, Toronto, ON M5V 3L9")
print(results)
[('290', 'StreetNumber'), ('Bremner', 'StreetName'), ('Blvd', 'StreetType'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('M5V', 'PostalCode'), ('3L9', 'PostalCode')]
I have the following file that I imported:
df = pd.read_csv(r'C:/Users/me/OneDrive/Documents/my_file.csv', encoding='latin-1')
name address
1 name1 290 Bremner Blvd, Toronto, ON M5V 3L9
2 name2 291 Bremner Blvd, Toronto, ON M5V 3L9
3 name3 292 Bremner Blvd, Toronto, ON M5V 3L9
I then applied the above function and export the file and everything works:
df['Address_Parse'] = df['ADDRESS'].apply(ap.parse)
df = pd.DataFrame(df)
df.to_csv(r'C:/Users/me/OneDrive/Documents/python_file.csv', index=False, header=True)
Problem: I now have another file (similar format) - but this time, I am getting an error:
df1 = pd.read_csv(r'C:/Users/me/OneDrive/Documents/my_file1.csv', encoding='latin-1')
df1['Address_Parse'] = df1['ADDRESS'].apply(ap.parse)
AttributeError: 'float' object has no attribute 'replace'
I am confused as to why the same code will not work for this file. As I am still learning Python, I am not sure where to begin to debug this problem. My guesses are that perhaps there are special characters in the second file, formatting issues or incorrect variable types that are preventing this ap.parse
function from working, but I am still not sure.
Can someone please show me what to do?
Thank you!
CodePudding user response:
Looking at the code from the library, we have this method for parse
in the AddressParser
class, and then this function for tokenize
that is called by parse
# method of AddressParser
def parse(self, address):
if not self.crf:
raise RuntimeError("Model is not loaded")
tokens = tokenize(address)
labels = self.crf.predict([transform(address)])[0]
return list(zip(tokens, labels))
def tokenize(s):
s = s.replace("#", " # ")
return [token for token in split(fr"[{puncts}\s] ", s) if token]
We can see here that tokenize
calls replace
, and so that is likely where your error is coming from. tokenize
is probably expecting a str
here (not a float), and that s.replace()
is almost certainly for a string replacement.
So, your column likely has floats in it when it expects strings. The tokenize
function should probably handle that better, but now it is up to you.
You should be able to resolve this by forcing your Address column to be strings (pandas will call it 'object').
df1['string_address'] = df1['ADDRESS'].astype(str)
df1['Address_Parse'] = df1['string_address'].apply(ap.parse)
CodePudding user response:
You can try read the csv file all in string by adding the dtype=str
df1 = pd.read_csv(r'C:/Users/me/OneDrive/Documents/my_file1.csv', encoding='latin-1', dtype=str)