I have the below df
:
ID Number Number 2 Number 3
1 10001 NaN NaN 5
2 10001 25 NaN 12
3 10001 78 4 NaN
4 10002 3 NaN NaN
5 10002 234 201 NaN
6 10002 NaN 510 34
7 10003 NaN 765 NaN
8 10003 NaN 422 NaN
9 10003 NaN 753 56
10 10003 231 7587 2345
I want to structure the data so the first NaN rows are deleted by column.
Resuling df
:
ID Number Number 2 Number 3
1 10001 25 4 5
2 10001 78 NaN 12
3 10001 NaN NaN NaN
4 10002 3 201 34
5 10002 234 510 NaN
6 10002 NaN NaN NaN
7 10003 231 765 56
8 10003 NaN 422 2345
9 10003 NaN 753 NaN
10 10003 NaN 7587 NaN
I'm essentially trying to shift the column data up by n rows depending on where the data starts for that column, so at the first rows of ID
there is always data in at least 1 of the Number
columns.
I've tried first_row_index
but this doesn't work by individual column
I've tried dropna
but I can't find a solution where I'm defining what number of rows to drop per column.
CodePudding user response:
df1 = df.melt('ID').dropna()
df1['var1'] = df1.groupby(['variable', 'ID']).cumcount()
df1.pivot(['ID', 'var1'], 'variable', 'value').reset_index(0)
variable ID Number Number 2 Number 3
var1
0 10001 25.0 4.0 5.0
1 10001 78.0 NaN 12.0
0 10002 3.0 201.0 34.0
1 10002 234.0 510.0 NaN
0 10003 231.0 765.0 56.0
1 10003 NaN 422.0 2345.0
2 10003 NaN 753.0 NaN
3 10003 NaN 7587.0 NaN
CodePudding user response:
EDIT:
I missed the ID
condition. You need to use groupby in that case.
# In one liner
df.groupby('ID').apply(lambda x:x.reset_index().apply(lambda y: y.shift(-y.first_valid_index())).set_index('index'))
Explanation:
- Group the dataframe by ID
- First
apply
(one withlambda x
) receives the grouped dataframe as parameterreset_index()
is to make the grouped dataframe starts with index 0 (else it'll use index from the whole dataframe)
- Second
apply
(one withlambda y
) receives the column of the grouped dataframe as parameter- Get the
first_valid_index
and shift it upwards
- Get the
- As the resetted index is now useless, we give back the
index
column to be used as index
ID Number Number 2 Number 3
0 10001.0 25.0 4.0 5.0
1 10001.0 78.0 NaN 12.0
2 10001.0 NaN NaN NaN
3 10002.0 3.0 201.0 34.0
4 10002.0 234.0 510.0 NaN
5 10002.0 NaN NaN NaN
6 10003.0 231.0 765.0 56.0
7 10003.0 NaN 422.0 2345.0
8 10003.0 NaN 753.0 NaN
9 10003.0 NaN 7587.0 NaN
df.apply(lambda x: x.shift(-x.first_valid_index()))
CodePudding user response:
Using a stacked version of the dataframe, then shifting by the number on leading NaN per group column:
(df.set_index('ID', append=True).stack(dropna=False)
.groupby(level=[1,2])
.apply(lambda s: s.shift(-(~s.notna().cummax()).sum()))
.unstack(-1)
.reset_index()
)
Output:
ID Number Number 2 Number 3
1 10001 25.0 4.0 5
2 10001 78.0 NaN 12
3 10001 NaN NaN NaN
4 10002 3.0 201.0 34
5 10002 234.0 510.0 NaN
6 10002 NaN NaN NaN
7 10003 231.0 765.0 56
8 10003 NaN 422.0 2345
9 10003 NaN 753.0 NaN
10 10003 NaN 7587.0 NaN
CodePudding user response:
Here is another approach, the first column will always have a value and for any ID, only the last row will have NaNs, if at all
# melt to make it a single column, so we drop all the NAN cells/rows
df2=df.melt('ID').dropna(axis=0)
# count the number of values for an ID
df2['ID_Count'] = df2.groupby(['ID']).cumcount()
# Group the result into a set of 3, since we have three columns number, number_2, number_3
df2['new_var'] = (df2['ID_Count'] // (3))
# Generate a new column name
df2['new_var_group'] = 'Number_' df2.groupby(['ID','new_var']).cumcount().astype(str)
# finally reverse the melt and gnerate the table same as before
df2 = df2.pivot_table(index=['ID','new_var' ], columns='new_var_group', values='value').reset_index().drop(columns='new_var', axis=1)
df2
new_var_group ID Number_0 Number_1 Number_2
0 10001 25.0 78.0 4.0
1 10001 5.0 12.0 NaN
2 10002 3.0 234.0 201.0
3 10002 510.0 34.0 NaN
4 10003 231.0 765.0 422.0
5 10003 753.0 7587.0 56.0
6 10003 2345.0 NaN NaN