I have big DF with values and names. After interpolating values with increment 0.1 I want to remove the rows with numbers such as 1111.123, and keep only rows with 1111.0, 1111.1, 1111.2 (or 1111.100...)
Value | Name |
---|---|
1111.1 | Mark |
1111.2 | Mark |
1111.234 | Mark |
1111.3 | Mark |
1111.346 | Mark |
1111.4 | Mark |
I've tried Series.str.split()
by string but it is too slow for CSV bigger than 1 GB. I've tried regex with
m = df['Value'].str.filter(regex='(\d*)\.(\d{3})')
But it returns empty Series.
Another option I've tried is math.modf
, but it returns
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Here is the code itself:
import pandas as pd
from math import modf
df = pd.read_csv("file.csv")
split = ((df['Value'].apply(lambda x: modf(x)[0])) * 10).apply(lambda x: modf(x)[0])
target_value = 0 < split < 1
df1 = df.loc[split == target_value]
*some code to remove rows*
The final result I expect is:
Value | Name |
---|---|
1111.1 | Mark |
1111.2 | Mark |
1111.3 | Mark |
1111.4 | Mark |
I am new to Python, but still believe there are more efficient ways to deal with it.
CodePudding user response:
If you do not want to convert your floats to string, use this simple math trick to check whether there are decimals after the first one:
df[(df['Value']*10%1).eq(0)]
Alternatively, you can simple check if the second to last character is a .
:
df[df['Value'].astype(str).str[-2].eq('.')]
output:
Value Name
0 1111.1 Mark
1 1111.2 Mark
3 1111.3 Mark
5 1111.4 Mark
CodePudding user response:
Try this:
df["value"] = df["value"].apply(lambda x: x - x % 0.1)
CodePudding user response:
First is necessary read values like strings and then test pattern digits, '.' and one last digit:
df = pd.read_csv("file.csv", dtype={'Value':str})
df = df[df['Value'].str.match('\d \.\d{1}$')]
Another idea without converting to strings, but possible float accuracy problems like 1111.100000000094894
:
s = df['Value'] * 10
df = df[s.eq(s.astype(int))]
print (df)
Value Name
0 1111.1 Mark
1 1111.2 Mark
3 1111.3 Mark
5 1111.4 Mark
CodePudding user response:
Exclude any row which doesnt have 1 as a single digit between the dot and the end or does not have 0 between the dot and and end.
df[(df['Value'].astype(str).str.contains('(?<=\.)[^0]$|(?<=\.)[1]$'))]
Value Name
0 1111.1 Mark
1 1111.2 Mark
3 1111.3 Mark
5 1111.4 Mark
Based on your comment lets try
Value Name
0 1111.100000 Mark
1 1111.200000 Mark
2 1111.230000 Mark
3 1111.300000 Mark
4 1111.346000 Mark
5 1111.400000 Mark
6 1111.100099 Added
Output
Value Name
0 1111.1 Mark
1 1111.2 Mark
3 1111.3 Mark
5 1111.4 Mark