How to create top worst sales product?-CodePudding

My idea is merge all product to each day in year to Sales Data. Because I don't have the new product launch date data, so basing on the first order containing the product, I will remove the previous ones. I don't known how to coding it right. Here is the simple data I created:

import pandas as pd
data = [
[ 1/1/21, nan, A, nan, nan ],
[ 1/1/21, nan, B, nan, nan ],
[ 1/1/21, nan, C, nan, nan ],
[ 1/2/21, PO_1, A, 50000, 1],
[ 1/2/21, nan, B, nan, nan ],
[ 1/2/21, nan, C, nan, nan ],
[ 1/3/21, nan, A, nan, nan],
[ 1/3/21, nan, B, nan, nan ],
[ 1/3/21, nan, C, nan, nan ]]
df = pd.DataFrame(data, columns=['order_date', 'po', 'product_code', 'sales', 'qty sold'])
print(df)

Based on the first order of product A (1/2/21), how to delete previous rows containing product A (the first row) and keep rows containing product A after 1/2/21?

CodePudding user response：

IIUC, group by product_code then find rows with valid po and compute cumulative sum. Finally remove, all rows where cumsum equals 0.

Suppose the following dataframe. I slightly modified yours to have another valid value for 'C'

>>> df
  order_date    po product_code    sales  qty sold
0     1/1/21   NaN            A      NaN       NaN  # drop
1     1/1/21   NaN            B      NaN       NaN  # drop
2     1/1/21   NaN            C      NaN       NaN  # drop
3     1/2/21  PO_1            A  50000.0       1.0  # keep
4     1/2/21   NaN            B      NaN       NaN  # drop
5     1/2/21  PO_2            C  10000.0       1.0  # keep
6     1/3/21   NaN            A      NaN       NaN  # keep
7     1/3/21   NaN            B      NaN       NaN  # drop
8     1/3/21   NaN            C      NaN       NaN  # keep

>>> df.loc[df.groupby('product_code', sort=False)['po']
             .apply(lambda x: pd.notna(x).cumsum())
             .loc[lambda x: x > 0].index]

  order_date    po product_code    sales  qty sold
3     1/2/21  PO_1            A  50000.0       1.0
5     1/2/21  PO_2            C  10000.0       1.0
6     1/3/21   NaN            A      NaN       NaN
8     1/3/21   NaN            C      NaN       NaN

Note, assuming your dataframe is sorted by order_date.

CodePudding user response：

From your question, it seems you want to keep NaN values of product codes that appear before A in the dataframe even if they didn't sell (e.g. B and C):

df = df.sort_values(by=['product_code','order_date'])
lst = []
for _, x in df.groupby('product_code'):
    y = x[pd.notna(x['po'])]
    if not y.empty:
        lst.append(x[x.index>=y.index[0]])
    else:
        lst.append(x)
out = pd.concat(lst).sort_index()

Output:

    order_date  po  product_code    sales   qty sold
1   1/1/21  NaN B   NaN NaN
2   1/1/21  NaN C   NaN NaN
3   1/2/21  PO_1    A   50000.0 1.0
4   1/2/21  NaN B   NaN NaN
5   1/2/21  NaN C   NaN NaN
6   1/3/21  NaN A   NaN NaN
7   1/3/21  NaN B   NaN NaN
8   1/3/21  NaN C   NaN NaN