My idea is merge all product to each day in year to Sales Data. Because I don't have the new product launch date data, so basing on the first order containing the product, I will remove the previous ones. I don't known how to coding it right. Here is the simple data I created:
import pandas as pd
data = [
[ 1/1/21, nan, A, nan, nan ],
[ 1/1/21, nan, B, nan, nan ],
[ 1/1/21, nan, C, nan, nan ],
[ 1/2/21, PO_1, A, 50000, 1],
[ 1/2/21, nan, B, nan, nan ],
[ 1/2/21, nan, C, nan, nan ],
[ 1/3/21, nan, A, nan, nan],
[ 1/3/21, nan, B, nan, nan ],
[ 1/3/21, nan, C, nan, nan ]]
df = pd.DataFrame(data, columns=['order_date', 'po', 'product_code', 'sales', 'qty sold'])
print(df)
Based on the first order of product A (1/2/21), how to delete previous rows containing product A (the first row) and keep rows containing product A after 1/2/21?
CodePudding user response:
IIUC, group by product_code
then find rows with valid po
and compute cumulative sum. Finally remove, all rows where cumsum equals 0.
Suppose the following dataframe. I slightly modified yours to have another valid value for 'C'
>>> df
order_date po product_code sales qty sold
0 1/1/21 NaN A NaN NaN # drop
1 1/1/21 NaN B NaN NaN # drop
2 1/1/21 NaN C NaN NaN # drop
3 1/2/21 PO_1 A 50000.0 1.0 # keep
4 1/2/21 NaN B NaN NaN # drop
5 1/2/21 PO_2 C 10000.0 1.0 # keep
6 1/3/21 NaN A NaN NaN # keep
7 1/3/21 NaN B NaN NaN # drop
8 1/3/21 NaN C NaN NaN # keep
>>> df.loc[df.groupby('product_code', sort=False)['po']
.apply(lambda x: pd.notna(x).cumsum())
.loc[lambda x: x > 0].index]
order_date po product_code sales qty sold
3 1/2/21 PO_1 A 50000.0 1.0
5 1/2/21 PO_2 C 10000.0 1.0
6 1/3/21 NaN A NaN NaN
8 1/3/21 NaN C NaN NaN
Note, assuming your dataframe is sorted by order_date
.
CodePudding user response:
From your question, it seems you want to keep NaN
values of product codes that appear before A
in the dataframe even if they didn't sell (e.g. B
and C
):
df = df.sort_values(by=['product_code','order_date'])
lst = []
for _, x in df.groupby('product_code'):
y = x[pd.notna(x['po'])]
if not y.empty:
lst.append(x[x.index>=y.index[0]])
else:
lst.append(x)
out = pd.concat(lst).sort_index()
Output:
order_date po product_code sales qty sold
1 1/1/21 NaN B NaN NaN
2 1/1/21 NaN C NaN NaN
3 1/2/21 PO_1 A 50000.0 1.0
4 1/2/21 NaN B NaN NaN
5 1/2/21 NaN C NaN NaN
6 1/3/21 NaN A NaN NaN
7 1/3/21 NaN B NaN NaN
8 1/3/21 NaN C NaN NaN