Python - how do I filter data based on suffix number?-CodePudding

I have a table Transactions that has columns OrderNumbers. In this columns I have values for each order in format as:

OrderNumber
-----------
0001
0002
0003
0003-1
0004
0005
0006
0006-1
0006-2
0007
0008
0009
0010
0010-1
0011
0011-1
0011-2
0011-3
0012

Some of the orders are being canceled and recreated. Which is where I have -1 and -2 and -3 suffix. The -3 would be the latest one and this can go up to -3,-4 and even -5 times to be recreated.

What I need to do is to select only latest orders as shown below:

OrderNumber
-----------
0001
0002
0003-1
0004
0005
0006-2
0007
0008
0009
0010-1
0011-3
0012

I tried using pandas

import pandas as pd
df = df[df["OrderNumber"].str.contains("-")]

which will select all recreated orders but if I my orders is recreated multiple times i just need the latest one.

And if the order was never recreated i need original one as I showed in example above.

How do I do this with pandas?

CodePudding user response：

You can extract the first part of the order numbers, then use duplicated to detect the last order for each number:

# also str.extract('([^-] )')
ordernumbers = df['OrderNumber'].str.extract('(\d )')
df[~ordernumbers.duplicated(keep='last')]

Output:

   OrderNumber
0         0001
1         0002
3       0003-1
4         0004
5         0005
8       0006-2
9         0007
10        0008
11        0009
13      0010-1
17      0011-3
18        0012

CodePudding user response：

This one works for me, in case you convert to a df and read the values as integers.

lista_df = []

for i in df['OrderNumber']:
    if isinstance(i, str):
        lista_df[-1] = i
    else:
        lista_df.append(str(i).zfill(4)) # Put zeros to the left

new_df = pd.DataFrame(lista_df)
new_df.columns = ['OrderNumber']
print(new_df)

# Output
OrderNumber
0   0001
1   0002
2   0003-1
3   0004
4   0005
5   0006-2
6   0007
7   0008
8   0009
9   0010-1
10  0011-3
11  0012