How I can handle my dataset (insert a colomn that depend on value existant) in Python?-CodePudding

My dataset is like this:

ExecutionTime	Code	Amount
09102021D081020	HUUSNJUNJ	500000
09102021D081020	HUNSKMWKKS	500000
09102021D093042	NAUWJMKMS	498333
09102021D093042	SJIAJIJIJI	498333
09102021D093042	BHDYWHWHY	498333
09102021D081020	JHWJDNJWND	454555

Then I expect to detect that when I have 2 lines ExecutionTime is the same, it is Type 1, 3 lines ExecutionTime same, it is Type 2 and I get that line to new dataframe like this:

ExecutionTime	Code	Type of Operation
09102021D081020	HUUSNJUNJ	Type 1
09102021D093042	NAUWJMKMS	Type 2

Welcome to any idea and suggestions! Thanks!

CodePudding user response：

IIUC, You can use groupby on "ExecutionTime","Amount" and transform the count, then with a helper dictionary, map the values and then aggregate first with groupby on ExecutionTime:

s = df.groupby(["ExecutionTime","Amount"])['ExecutionTime'].transform('count')
d = {2:'Type 1',3:'Type 2'} #expand your dict as necessary
out = (df.assign(Type=s.map(d)).groupby('ExecutionTime',as_index=False)
       [['Code','Type']].first())

print(out)

     ExecutionTime       Code    Type
0  09102021D081020  HUUSNJUNJ  Type 1
1  09102021D093042  NAUWJMKMS  Type 2

CodePudding user response：

Assuming you want to group by ExecutionTime and Amount and only keep the groups with 2 or 3 rows. Use groupby with a custom function and a defaultdict, then dropna

from collections import defaultdict
maps = defaultdict(lambda:None, {2: 'Type 1', 3: 'Type 2'})

(df.groupby(['ExecutionTime', 'Amount'])['Code']
   .agg([('Code', 'first'), ('Type of operation', lambda s: maps[s.count()])])
   .dropna(subset=['Type of operation'])
   .droplevel(1)
)

output:

                      Code Type of operation
ExecutionTime                               
09102021D081020  HUUSNJUNJ            Type 1
09102021D093042  NAUWJMKMS            Type 2

CodePudding user response：

Try this:

new_df = (
    df.assign(
        **{'Type of Operation': (
            df.groupby(
                df.ExecutionTime.ne(
                    df.ExecutionTime.shift(1)
                ).cumsum()
            )['Code']
            .transform(lambda x: f'Type {x.shape[0] - 1}')
        )}
    )
    .drop_duplicates(['Type of Operation'])
    .reset_index(drop=True)
)

Output:

>>> new_df
     ExecutionTime        Code  Amount Type of Operation
0  09102021D081020   HUUSNJUNJ  500000            Type 1
1  09102021D093042   NAUWJMKMS  498333            Type 2
2  09102021D081020  JHWJDNJWND  454555            Type 0

Compact version:

df.assign(**{'Type of Operation':df.groupby(df.ExecutionTime.ne(df.ExecutionTime.shift(1)).cumsum())['Code'].transform(lambda x: f'Type {x.shape[0] - 1}')}).drop_duplicates(['Type of Operation']).reset_index(drop=True)