My data: 62 people have 400 data points, some have 200 (and 200 NaN rows).
(Rows = 400, Columns = 62)
I have to drop every other value for the individuals that have 400 data points, so that everyone has 200. I have keep the first row (index 0), drop the second (index 1), keep the third row (index 2), and so on. I cannot just drop all the rows with Nan, because then they wouldn't be aligned.
p1 p2 p3 p4
0 0.0 1.0 0.0 1.0
1 2.0 3.0 0.3 0.4
2 4.0 5.0 NaN NaN
3 6.0 7.0 NaN NaN
Result
p1 p2 p3 p4
0 0.0 1.0 0.0 1.0
1 4.0 5.0 0.3 0.4
CodePudding user response:
New answer:
Given the input, we could select the even numbered rows from the columns without NaN, reset its index and concatenate it with the columns with NaNs, after dropping their NaNs:
msk = df.isna().any()
pd.concat([df.loc[::2, ~msk].reset_index(drop=True), df.loc[:, msk].dropna()], axis=1)
Output:
p1 p2 p3 p4
0 0.0 1.0 0.0 1.0
1 4.0 5.0 0.3 0.4
Old answer:
IIUC, you have a DataFrame that looks like below:
p1 p2 p3 p4
0 0.0 1.0 0.0 1.0
1 2.0 3.0 NaN NaN
2 4.0 5.0 4.0 NaN
3 6.0 7.0 NaN 7.0
4 8.0 9.0 8.0 9.0
As you noted, if we use df.dropna()
, we'll remove most of the observations. IIUC, your approach would not get rid of all the NaN values. I propose an alternative approach, one that shifts all non-NaN values up in each column:
n = 3 # for your case n=200; this is the number of non-NaN values in columns with NaN
out = (df.stack()
.groupby(level=1).sample(n)
.reset_index(name='val')
.assign(i=lambda x: x.groupby('level_1').cumcount())
.pivot('i', 'level_1', 'val')
.reset_index(drop=True).rename_axis(columns=[None]))
Output:
p1 p2 p3 p4
0 2.0 3.0 8.0 9.0
1 4.0 5.0 4.0 1.0
2 8.0 1.0 0.0 7.0