I have a long list dataframe that has texts similar to the example provided in the image. dataframe is called gbs2. I want to gbs2.drop
rows with numerical values that has 4 decimal places or 5 numbers ("1.2345" or "12345")
Code used:
name = pd.Series(['Product A', 'Product B', 'Product C'])
Produced_Qty = pd.Series(['ABC 1.407', 'NOP 1.6345', 'XYZ 1.3144/1.7523 ect'])
Passed_Qty = pd.Series([499768, 756662, 301297])
gbs2 = pd.DataFrame({'Name':name,
'Produced Qty': Produced_Qty,
'Passed Qty':Passed_Qty})
gbs2
In this case, Product B and Product C should be dropped. My original case which I cannot post has longer texts where the interested number is in the middle, but has many words from the start and end of the string.
I could not find ways to identify numerical values in a string that has both alphabets and numbers.
CodePudding user response:
You can use \d .\d{4}
to match a floating point number with 4 decimals and \d{5}
for a number with 5 digits combined with str.contains
, then invert the boolean with ~
and perform boolean indexing:
out = gbs2.loc[~gbs2['Produced Qty'].str.contains(r'\d .\d{4}|\d{5}')]
Output:
Name Produced Qty Passed Qty
0 Product A ABC 1.407 499768
If you want to match exactly 4/5 digits and no more, you additionally have to ensure the pattern is not having extra digits around it, then use '(?<!\d)(?:\d .\d{4}|\d{5})(?!\d)'