I have problems filtering numeric data in pandas.
I have 10 000 data and i need them to filter out values in 3rd column > then 10.
Data in 3rd column are dtype object and cells of this column contains 3 types of data: dot (as None); 12.25 (single values); 12,45,12.5 (multiple values separated by ,).
I tried:
- separating the values by ',' with str.methods
- filtering splited values with greater then 10
- with df.loc then filter main dataframe with filtered columns (filter value - values from filtered column == same column from main dataframe)
#Data sample
{'POS': {0: 20482821,
1: 20482980,
2: 20483463,
3: 20485526,
4: 20485536,
5: 20485630,
6: 20485811,
7: 20485948,
8: 109274570,
9: 109274623,
10: 109274677,
11: 109274857,
12: 109274968,
13: 109275216,
14: 109275325,
15: 109275506,
16: 109275536,
17: 109275600,
18: 109275641,
19: 109275648,
20: 109275684,
21: 197042891,
22: 197042926,
23: 197043092,
24: 197043111},
'CHROM': {0: 'chr1',
1: 'chr1',
2: 'chr1',
3: 'chr1',
4: 'chr1',
5: 'chr1',
6: 'chr1',
7: 'chr1',
8: 'chr1',
9: 'chr1',
10: 'chr1',
11: 'chr1',
12: 'chr1',
13: 'chr1',
14: 'chr1',
15: 'chr1',
16: 'chr1',
17: 'chr1',
18: 'chr1',
19: 'chr1',
20: 'chr1',
21: 'chr3',
22: 'chr3',
23: 'chr3',
24: 'chr3'},
'CADD_phred': {0: 14.27,
1: '.',
2: '.',
3: 17.1,
4: 17.61,
5: '20.1,19.64',
6: 15.99,
7: 15.95,
8: 1.551,
9: 5.142,
10: 14.05,
11: 6.579,
12: 1.225,
13: 14.38,
14: 5.841,
15: 3.85,
16: 4.373,
17: '.',
18: 16.95,
19: 16.94,
20: 3.067,
21: '.',
22: 5.925,
23: 10.3,
24: 9.495}}
CodePudding user response:
If you have the data in a dataframe, e.g. df, all you have to do:
df = df[df['#3'] < 10]
CodePudding user response:
If you want to keep rows whose values are less than threshold (e.g. 10.0)
Code
def greater(x, threshold = 10.0):
'''
Checks if there is a value > threshold
x values
- comma separated string
- float
- empty
'''
if x and isinstance(x, str) and "," in x:
# comma separated string
# Check if list has a value > threshold
return any(v > threshold for v in [float(i) for i in x.split(',')])
elif x and isinstance(x, float):
# Float
return x > threshold
else:
# Empty
return False
# Generate dataframe from posted data
df = pd.DataFrame(data)
# Keep rows that don't have a value > 10 in column CADD_phred
filtered = df[df.CADD_phred.apply(lambda x: not greater(x, 10.0))]
print(filtered)
Output
POS CHROM CADD_phred
1 20482980 chr1 .
2 20483463 chr1 .
8 109274570 chr1 1.551
9 109274623 chr1 5.142
11 109274857 chr1 6.579
12 109274968 chr1 1.225
14 109275325 chr1 5.841
15 109275506 chr1 3.850
16 109275536 chr1 4.373
17 109275600 chr1 .
20 109275684 chr1 3.067
21 197042891 chr3 .
22 197042926 chr3 5.925
24 197043111 chr3 9.495