Python Pandas: Dropped Rows Based on Recurrence Value From Previous Rows-CodePudding

I'm at my wits end here. Any help would be very much appreciated!

This is my current data. The product changed from A.1 to A.2. What I want is when the A.1 turns into A.2, I want the rest of my data to only start counting cycle time for A.2. The same for B.1 and B.3.

Here is the current data.

 --------- ----------- -- 
| Product | CycleTime |  |
 --------- ----------- -- 
| A.1     | 10        |  |
 --------- ----------- -- 
| A.1     | 5         |  |
 --------- ----------- -- 
| A.1     | 1         |  |
 --------- ----------- -- 
| A.1     | 2         |  |
 --------- ----------- -- 
| A.1     | 1         |  |
 --------- ----------- -- 
| A.2     | 5         |  |
 --------- ----------- -- 
| A.2     | 1         |  |
 --------- ----------- -- 
| A.1     | 2         |  |
 --------- ----------- -- 
| A.1     | 10        |  |
 --------- ----------- -- 
| A.2     | 10        |  |
 --------- ----------- -- 
| B.1     | 1         |  |
 --------- ----------- -- 
| B.1     | 2         |  |
 --------- ----------- -- 
| B.1     | 1         |  |
 --------- ----------- -- 
| B.3     | 5         |  |
 --------- ----------- -- 
| B.1     | 1         |  |
 --------- ----------- -- 
| B.3     | 2         |  |
 --------- ----------- -- 
| B.1     | 10        |  |
 --------- ----------- --

Here is what i'm trying to produce.

 --------- ----------- -- 
| Product | CycleTime |  |
 --------- ----------- -- 
| A.1     | 10        |  |
 --------- ----------- -- 
| A.1     | 5         |  |
 --------- ----------- -- 
| A.1     | 1         |  |
 --------- ----------- -- 
| A.1     | 2         |  |
 --------- ----------- -- 
| A.1     | 1         |  |
 --------- ----------- -- 
| A.2     | 5         |  |
 --------- ----------- -- 
| A.2     | 1         |  |
 --------- ----------- -- 
| A.2     | 10        |  |
 --------- ----------- -- 
| B.1     | 1         |  |
 --------- ----------- -- 
| B.1     | 2         |  |
 --------- ----------- -- 
| B.1     | 1         |  |
 --------- ----------- -- 
| B.3     | 5         |  |
 --------- ----------- -- 
| B.3     | 2         |  |
 --------- ----------- --

CodePudding user response：

setup:

df = pd.DataFrame(
    {
        "ProductType":["A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "A", "B", "B", "B"],
        "Product":["A.1", "A.1", "A.1", "A.1", "A.1", "A.2", "A.2", "A.1", "A.1", "A.2", "B.1", "B.1", "B.1", "B.3", "B.1", "B.3", "B.1"],
        "CycleTime":[10,5,1,2,1,5,1,2,10,10,1,2,1,5,1,2,10],
    }
)

Create an order between products. The order between different product types does not matter, only within each product type. I.e. take several partial orderings and stitch them together to create a total ordering. You may have to do this manually, perhaps you can do it programmatically, it depends on your problem.

order = ["A.1", "A.2", "B.1", "B.2", "B.3"]

Create a map, and its inverse which associate each product with its position in the order.

mapping = dict(enumerate(order))
inversemapping= {v:k for k,v in mapping.items()}

The idea is, for each product type, to map the product to its order value, and calculate the cumulative maximum of these order values, before mapping back to the products.

df.groupby("ProductType").apply(lambda d: d["Product"].map(inversemapping).cummax().map(mapping)).values

You will get the following numpy array:

 array(['A.1', 'A.1', 'A.1', 'A.1', 'A.1', 'A.2', 'A.2', 'A.2', 'A.2',
   'A.2', 'B.1', 'B.1', 'B.1', 'B.3', 'B.3', 'B.3', 'B.3'],
  dtype=object)

CodePudding user response：

If you drop duplicates and keep the first values, the index of the next product show you from where it is no longer possible to find the current product:

find_pos = lambda x: x.drop_duplicates('Product')['Product'].shift().dropna() \
          .rename_axis('not_valid_after').reset_index()

conds = df.groupby(df['Product'].str.split('.').str[0]) \
          .apply(find_pos).reset_index(drop=True)

print(conds)

# Output:
   not_valid_after Product
0                5     A.1  # 5 is the first index of A.2
1               13     B.1  # 13 is the first index of B.3

Now you can filter your dataframe:

out = df.drop(conds.apply(lambda x: df.loc[df['Product'] == x['Product']]
                                      .loc[x['not_valid_after']:].index.tolist(),
                           axis=1).explode().dropna().tolist())

Output:

>>> out
   Product  CycleTime
0      A.1         10
1      A.1          5
2      A.1          1
3      A.1          2
4      A.1          1
5      A.2          5
6      A.2          1
9      A.2         10
10     B.1          1
11     B.1          2
12     B.1          1
13     B.3          5
15     B.3          2