I have a table Transactions that has columns OrderNumbers. In this columns I have values for each order in format as:
OrderNumber
-----------
0001
0002
0003
0003-1
0004
0005
0006
0006-1
0006-2
0007
0008
0009
0010
0010-1
0011
0011-1
0011-2
0011-3
0012
Some of the orders are being canceled and recreated. Which is where I have -1 and -2 and -3 suffix. The -3 would be the latest one and this can go up to -3,-4 and even -5 times to be recreated.
What I need to do is to select only latest orders as shown below:
OrderNumber
-----------
0001
0002
0003-1
0004
0005
0006-2
0007
0008
0009
0010-1
0011-3
0012
I tried using pandas
import pandas as pd
df = df[df["OrderNumber"].str.contains("-")]
which will select all recreated orders but if I my orders is recreated multiple times i just need the latest one.
And if the order was never recreated i need original one as I showed in example above.
How do I do this with pandas?
CodePudding user response:
You can extract the first part of the order numbers, then use duplicated
to detect the last order for each number:
# also str.extract('([^-] )')
ordernumbers = df['OrderNumber'].str.extract('(\d )')
df[~ordernumbers.duplicated(keep='last')]
Output:
OrderNumber
0 0001
1 0002
3 0003-1
4 0004
5 0005
8 0006-2
9 0007
10 0008
11 0009
13 0010-1
17 0011-3
18 0012
CodePudding user response:
This one works for me, in case you convert to a df and read the values as integers.
lista_df = []
for i in df['OrderNumber']:
if isinstance(i, str):
lista_df[-1] = i
else:
lista_df.append(str(i).zfill(4)) # Put zeros to the left
new_df = pd.DataFrame(lista_df)
new_df.columns = ['OrderNumber']
print(new_df)
# Output
OrderNumber
0 0001
1 0002
2 0003-1
3 0004
4 0005
5 0006-2
6 0007
7 0008
8 0009
9 0010-1
10 0011-3
11 0012