I try to separate a Dataframe based on rows with a certain value in multiple columns, so that the original Dataframe is split in two with all rows containing the value in one Dataframe and the other Dataframe with the residual rows.
df = pd.DataFrame(np.random.randint(-1,100,size=(100, 4)), columns=list('ABCD'))
df
A B C D
0 51 86 15 80
1 61 53 75 66
2 80 48 23 58
3 86 25 37 99
4 50 11 87 71
... ... ... ... ...
95 34 40 43 40
96 89 16 83 72
97 97 32 24 26
98 27 83 75 29
99 24 50 40 43
100 rows × 4 columns
df[~df.isin([-1])].dropna()
A B C D
0 51 86 15 80.0
1 61 53 75 66.0
2 80 48 23 58.0
3 86 25 37 99.0
4 50 11 87 71.0
... ... ... ... ...
95 34 40 43 40.0
96 89 16 83 72.0
97 97 32 24 26.0
98 27 83 75 29.0
99 24 50 40 43.0
98 rows × 4 columns
df[df.isin([-1])].dropna()
A B C D
is what i tried so far and the first part worked correctly. However df[df.isin([-1])].dropna()
failed.
CodePudding user response:
Possible solution is the following:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(-1,100,size=(100, 4)), columns=list('ABCD'))
df
# df1 = df[~df.isin([80])].dropna().reset_index(drop=True)
# or
df1 = df[~df.eq(80).any(1)].reset_index(drop=True)
df1
df2 = df[df.eq(80).any(1)].reset_index(drop=True)
df2
CodePudding user response:
Your code is almost correct. Use any(axis=1)
to keep only one boolean value for each row instead of using dropna(how='all')
The same with a reproducible example:
import pandas as pd
import numpy as np
np.random.seed(2022)
vals = np.random.choice([-1, 0, 1], size=(10, 4), p=[.2, .4, .4])
df = pd.DataFrame(vals, columns=list('ABCD'))
m = df.isin([-1]).any(axis=1) # or df.eq(-1).any(axis=1)
df1, df2 = df[m], df[~m]
Output:
>>> df.assign(M=m)
A B C D M
0 -1 0 -1 -1 True
1 1 0 1 1 False
2 1 1 1 1 False
3 1 1 0 0 False
4 0 1 1 -1 True
5 1 0 0 1 False
6 -1 0 1 0 True
7 0 0 0 0 False
8 1 -1 1 0 True
9 1 1 0 1 False
>>> df1
A B C D
0 -1 0 -1 -1
4 0 1 1 -1
6 -1 0 1 0
8 1 -1 1 0
>>> df2
A B C D
1 1 0 1 1
2 1 1 1 1
3 1 1 0 0
5 1 0 0 1
7 0 0 0 0
9 1 1 0 1