I have some strings, some of which are gibberish, a mixture of digits and letters. The gibberish, I would like to remove, but those with a pattern, I would like to keep.
I am providing an example for illustrative purposes.
strings = ["1Z83E0590391137855",
"55t5555t5t5tttt5t5555tttttttgggggggggggggggsss",
"1st", "2nd", "3rd", "4th", "5th"
]
import pandas as pd
df = pd.DataFrame(strings, columns=['strs'])
df
I would like to remove strings that look like
1Z83E0590391137855
55t5555t5t5tttt5t5555tttttttgggggggsss
and keep strings that look like ones below
1st
2nd
3rd
4th
5th
Given my limited regex and python experience, I am having some difficulty coming up with the right formulation. What I have tried, has removed everything, except the first row:
df['strs'] = df['strs'].str.replace(r'(?=.*[a-z])(?=.*[\d])[a-z\d] ', '', regex=True)
CodePudding user response:
I suggest only matching alphanumeric strings containing both letters and digits that contain a certain amount of chars.
In the example below, I set the threshold to 18, i.e. the strings shorter than 18 chars won't be matched and thus will remain in the column. All the strings equal or longer will get removed:
df['strs'] = df['strs'].str.replace(r'^(?=.{18})(?:[a-zA-Z] \d|\d [a-zA-Z])[a-zA-Z\d]*$', '', regex=True)
Details:
^
- start of string(?=.{18})
- the string must start with 18 chars other than line break chars(?:[a-zA-Z] \d|\d [a-zA-Z])
- one or more letters and then a digit or one or more digits and then a letter[a-zA-Z\d]*
- zero or more alphanumeric chars$
- end of string.
See the regex demo.
CodePudding user response:
You could check that the line does not start with 1st 2nd.. to remove only those lines.
^(?!\d (?:st|nd|rd|th)$).*$