Here is a sample of a dataset I have:
ID | Project |
---|---|
1 | 21st Townhouse 318 |
2 | The Residences 6 |
3 | Villanova Tower B |
4 | The Hills H |
5 | City Park |
I need to transform 'Project' column so that:
- if a row ends with numeric values, they should be dropped
- if a row ends with single letter, it should be dropped
- else, leave as it is
Here is how I want it to look like:
ID | Project |
---|---|
1 | 21st Townhouse |
2 | The Residences |
3 | Villanova Tower |
4 | The Hills |
5 | City Park |
I tried to search for some solution, and found this(for first condition with numeric values only):
df['Project']=df.Project[~((df.Project.astype(str).str.match("(.*\d)")) & (df.Project.astype(str).str.len() > 1))]
It worked, however, I tried to apply it for the second condition as well:
df['Project']=df.Project[~((df.Project.astype(str).str.match("(.*\w)")) & (df.Project.astype(str).str.len() == 1))]
But, It failed
Can you help me, please? Thank you!
CodePudding user response:
You can use
df['Project']=df['Project'].str.replace(r'\s (?:\d |[A-Za-z])$', '', regex=True)
See the regex demo.
Details:
\s
- one or more whitespaces(?:\d |[A-Za-z])
- a non-capturing group matching either\d
- one or more digits|
- or[A-Za-z]
- an ASCII letter (use[^\W\d_]
to match any Unicode letter)
$
- end of string.
If your strings have no whitespace before the last number/letter, you may use
df['Project'].str.replace(r'\b(?:\d |[A-Za-z])$', '', regex=True).str.rstrip()
The \b
matches a word boundary making sure there is no letter, digit or _
right before the final digits/letter, and .str.rstrip()
will strip the trailing whitespace.