I have a grouped dataframe like so:
│ product │ category
│ spot1 spot2 spot3 │ spot1 spot2 spot3
──────────┼───────────────────────────┼─────────────────────────────
basket 1 │ NaN apple banana │ NaN fruits fruits
basket 2 │ almond carrot NaN │ nuts veggies NaN
One row represents a "basket" containing different food products (vegtables, fruits, nuts).
Each basket has 3 spots that can either contain a food product or not (=NaN).
I would like the first column of group product
to be as populated as possible. That means if there is a NaN value in the first column of the product group and some value in the 2nd or n-th column if should shift to the left for each group.
Categories are related: in the example above a baskets'
spot1
of group product
and spot1
of group category
belong together. Every data combination must have a value for product. If product is NaN then all the related items will be NaN as well.
The output should look something like:
│ product │ category
│ spot1 spot2 spot3 │ spot1 spot2 spot3
──────────┼───────────────────────────┼─────────────────────────────
basket 1 │ apple banana NaN │ fruits fruits NaN <-- this row shifted to left to "fill" first spot of product group
basket 2 │ almond carrot NaN │ nuts veggies NaN
jezrael's answer here was a good starting point for me:
#for each row remove NaNs and create new Series - rows in final df
df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
#if possible different number of columns like original df is necessary reindex
df1 = df1.reindex(columns=range(len(df.columns)))
#assign original columns names
df1.columns = df.columns
print (df1)
However, this solution ignores grouping. I only want values to shift left based on the specific group product
.
edit / minimal reproducible example
please use this code to get to the "starting point" of problem. The way I get to this point in my production code is more complex but this should do fine.
# Import pandas library
import pandas as pd
# initialize list of lists
data = [[1, 'NaN','NaN'], [1, 'apple','fruits'], [1,'banana', 'fruits'], [2, 'carrot','veggies'], [2, 'almond','nuts']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['basket','product', 'category'])
# print dataframe.
df
dfg = df.groupby(['basket', df.groupby(['basket']).cumcount() 1]).first().unstack().reset_index()
print(dfg)
CodePudding user response:
I trust there is an easier way to accomplish this, but the following should work.
Setup
import pandas as pd
import numpy as np
# make sure that `NaNs` are *actual* `NaNs`, e.g. not "NaN" (this is a string
# like any other); or use `[1, None, None]`
data = [[1, np.nan,np.nan],
[1, 'apple','fruits'],
[1,'banana', 'fruits'],
[2, 'carrot','veggies'],
[2, 'almond','nuts']]
df = pd.DataFrame(data, columns=['basket','product', 'category'])
dfg = df.groupby(['basket', df.groupby(['basket']).cumcount()
1]).first().unstack().reset_index()
print(dfg)
basket product category
1 2 3 1 2 3
0 1 None apple banana None fruits fruits
1 2 carrot almond NaN veggies nuts NaN
Code
dfg.set_index('basket',drop=True, inplace=True)
out = dfg.unstack().dropna()
out.index = pd.MultiIndex.from_arrays([
out.index.get_level_values(0),
out.groupby(level=[2,0]).cumcount().add(1).to_numpy(),
out.index.get_level_values(2)])
out = out.reset_index(drop=False).pivot(index='basket',
columns=['level_0','level_1'],
values=0)\
.reindex(dfg.columns, axis='columns').reset_index(drop=False)
print(out)
basket product category
1 2 3 1 2 3
0 1 apple banana NaN fruits fruits NaN
1 2 carrot almond NaN veggies nuts NaN
Explanation
- First, we use
df.unstack()
withSeries.dropna
to get a Series with aMultiIndex
that consists ofcol level 0
,col level 1
,index
. I.e.:
out = df.unstack().dropna()
print(out.head(4))
basket
product 1 2 carrot
2 1 apple
2 almond
3 1 banana
- Next, we use
df.groupby
on levels0,2
(i.e. originalcol level 0
andbasket
), and we usecumcount
to get consecutive numbers for the items in each group (adding 1:add(1)
. I.e. we are doing:
print(out.groupby(level=[2,0]).cumcount().add(1).to_numpy())
[1 1 2 2 1 1 2 2]
- We use this result inside
pd.MultiIndex.from_arrays
to overwrite theMultiIndex
(specificallylevel 1
) with a new index. I.e. we now have:
print(out.head(4))
basket
product 1 2 carrot
1 apple
2 2 almond
1 banana
- Now, finally, we can reset the index and use
df.pivot
to change the shape ofout
so that it matches the shape of the originaldf
. Chainingdf.reindex
applied to the columns will both reset the order of the columns and add all the missing columns (e.g.3
atcol level 1
for both values incol level 0
, and they will be automatically filled with NaNs.