I'm using pandas to analyze data from 3 different sources, which are imported into dataframes and require modification to account for human error, as this data was all entered by humans and contains errors.
Specifically, I'm working with street names. Until now, I have been using .str.replace() to remove street types (st., street, blvd., ave., etc.), as shown below. This isn't working well enough, and I decided I would like to use regex to match a pattern, and transform that entire column from the original street name, to the pattern matched by regex.
df['street'] = df['street'].str.replace(r' avenue ', '', regex=True)
I've decided I would like to use regex to identify (and remove all other characters from the address column's fields): any number of integers, followed by a space, and then the first 3 number of alphabetic characters. For example, "3762 pearl street" might become "3762 pea" if x is 3 with the following regex:
(\d ) \w{0,3}
How can I use panda's .str.replace to do this? I don't want to specify WHAT I want to replace with the second argument. I want to replace the original string with the pattern matched from regex. Something that, in my mind, might work like this:
df['street'] = df['street'].str.replace(ORIGINAL STRING, r' (\d ) \w{0,3}, regex=True)
which might make 43 milford st. into "43 mil".
Thank you, please let me know if I'm being unclear.
CodePudding user response:
you could use the extract method to overwrite the column with its own content
pat = r'(\d \s[a-zA-Z]{3})'
df['street'] = df['street'].str.extract(pat)
Just an observation: The regex you shared (\d ) \w{0,3}
matches the following patterns and returns some funky stuff as well
- 1131 1313 street
- 121 avenue
- 1 1 1 1 1 1 avenue
- 42
I've changed it up a bit based on what you described, but i'm not sure if that works for all your datapoints.