How can I assign a tag to the smallest in one group, second smallest in another group and third smal-CodePudding

I have the below data frame,

ID	Group	Date_Time_1	Date_Time_2	Difference	New_Column
123	A	14-10-2021 15:19	14-10-2021 15:32	13	First
123	A	14-10-2021 15:19	14-10-2021 15:36	17	null
123	A	14-10-2021 15:19	14-10-2021 15:37	18	null
123	A	14-10-2021 15:19	14-10-2021 16:29	70	null
123	A	14-10-2021 15:19	14-10-2021 17:04	105	null
123	B	14-10-2021 15:21	14-10-2021 15:32	11	null
123	B	14-10-2021 15:21	14-10-2021 15:36	15	Second
123	B	14-10-2021 15:21	14-10-2021 15:37	16	null
123	B	14-10-2021 15:21	14-10-2021 16:29	68	null
123	B	14-10-2021 15:21	14-10-2021 17:04	103	null
123	C	14-10-2021 15:22	14-10-2021 15:32	10	null
123	C	14-10-2021 15:22	14-10-2021 15:36	14	null
123	C	14-10-2021 15:22	14-10-2021 15:37	15	Third
123	C	14-10-2021 15:23	14-10-2021 16:29	67	Third_A
123	C	14-10-2021 15:48	14-10-2021 17:04	102	Third_B
789	A	14-10-2021 15:19	14-10-2021 15:32	13	First
789	A	14-10-2021 15:19	14-10-2021 15:36	17	null
789	B	14-10-2021 15:21	14-10-2021 15:32	11	null
789	B	14-10-2021 15:21	14-10-2021 15:36	15	Second
789	C	14-10-2021 15:22	14-10-2021 15:32	10	null

I am trying to create a new column which will assign "First" to the smallest "Date_Time_2" in group "A" and it will assign "second" to the second smallest "Date_Time_2" in group B. Similarly, it will assign "third" to the third smallest "Date_Time_2" in group C.

I want it to assign "Third_A", "Third_B" and so on once the loop reaches the last "Group" of the "ID". So, once it reaches the last group of "ID" it will assign "Third or 3" (As there are only three unique groups in the dataset) to the third lowest "Date_Time_2" which is not used in the previous groups and if it will find another "Date_Time_2" for a new "Date_Time_1" it will assign "Third_A", "Third_B" and so on

I have tried the below code but it is not working,

`df.drop('New_Column', axis = 1, inplace = True)
df['New_Column'] = pd.Series()
for i, v in df['Difference'].items():
    a = 0
    b = 1
    diff = df[df['Group'] == df['Group'].unique()[a]]['Difference'].nsmallest(b).min()
    if diff == v:
        df.loc[i, 'New_Column'] = "Yes"
        b = b   1
    a = a   1`

Any help here would be great!

CodePudding user response：

You could try the following:

from string import ascii_uppercase as letters

df["Date_Time_2"] = pd.to_datetime(df["Date_Time_2"])
for n, (_, gdf) in enumerate(df.sort_values("Date_Time_2").groupby("Group")):
    nths = gdf.groupby("Date_Time_2", as_index=False).ngroup()
    df.loc[gdf[nths == n].index, "New"] = str(n   1)
for i, c in zip(gdf[nths > n].index, letters):
    df.at[i, "New"] = f"{n   1}_{c}"

First make sure column Date_Time_2 contains datetimes.
Then group df by Group after sorting along Date_Time_2.
Then in each group identify the indices belonging to nth Date_Time_2 sub-group (starting from 0) and set n 1 on the resp. New column rows.
Then take the last group and add the lettered values to the New column.

Maybe you have to replace the last part with

for k, c in zip(range(n   1, nths.max()   1), letters):
    df.loc[gdf[nths == k].index, "New"] = f"{n   1}_{c}"

if the lettered values should be grouped too.

Result for the sample in the question:

     ID Group       Date_Time_1         Date_Time_2  Difference New_Column  New
0   123     A  14-10-2021 15:19 2021-10-14 15:32:00          13      First    1
1   123     A  14-10-2021 15:19 2021-10-14 15:36:00          17        NaN  NaN
2   123     A  14-10-2021 15:19 2021-10-14 15:37:00          18        NaN  NaN
3   123     A  14-10-2021 15:19 2021-10-14 16:29:00          70        NaN  NaN
4   123     A  14-10-2021 15:19 2021-10-14 17:04:00         105        NaN  NaN
5   123     B  14-10-2021 15:21 2021-10-14 15:32:00          11        NaN  NaN
6   123     B  14-10-2021 15:21 2021-10-14 15:36:00          15     Second    2
7   123     B  14-10-2021 15:21 2021-10-14 15:37:00          16        NaN  NaN
8   123     B  14-10-2021 15:21 2021-10-14 16:29:00          68        NaN  NaN
9   123     B  14-10-2021 15:21 2021-10-14 17:04:00         103        NaN  NaN
10  123     C  14-10-2021 15:22 2021-10-14 15:32:00          10        NaN  NaN
11  123     C  14-10-2021 15:22 2021-10-14 15:36:00          14        NaN  NaN
12  123     C  14-10-2021 15:22 2021-10-14 15:37:00          15      Third    3
13  123     C  14-10-2021 15:23 2021-10-14 16:29:00          67    Third_A  3_A
14  123     C  14-10-2021 15:48 2021-10-14 17:04:00         102    Third_B  3_B
15  789     A  14-10-2021 15:19 2021-10-14 15:32:00          13      First    1
16  789     A  14-10-2021 15:19 2021-10-14 15:36:00          17        NaN  NaN
17  789     B  14-10-2021 15:21 2021-10-14 15:32:00          11        NaN  NaN
18  789     B  14-10-2021 15:21 2021-10-14 15:36:00          15     Second    2
19  789     C  14-10-2021 15:22 2021-10-14 15:32:00          10        NaN  NaN

CodePudding user response：

First, make sure you read csv value currectly. Means date time value should be interpreted correctly, e.g.

date_parse = lambda x : pd.to_datetime(x, format="%d-%m-%Y %H:%M")
df = pd.read_csv('filename.csv', parse_dates=['Date_Time_1','Date_Time_2'], date_parser= date_parse)

If you already have dataframe, you can use following code to parse datetime object insde dataframe,

df['Date_Time_2'] = pd.to_datetime(df['Date_Time_2'], format="%d-%m-%Y %H:%M")
df['Date_Time_2'] = pd.to_datetime(df['Date_Time_2'], format="%d-%m-%Y %H:%M")

Now just iterate over different groups, and filter out the date_time_2 column in sorted list, finally take out the appropriate index, e.g. for group 'A' take '0' index, for group 'B' take out '1' index ..., Select the dataframe appropriately and update the value in new column

df['New_Column'] = 'NA'
for index, group in enumerate(df['Group'].unique()):
    unqiue_time = df[df['Group'] == group]['Date_Time_2'].unique()[index]
    df.loc[(df['Group'] == group) & (df['Date_Time_2'] == unqiue_time), 'New_Column'] = index
print(df)

Note: appending number is lot easier then, word like 'first', 'second', if you want, create a new list, and assign the value from index, like below

df['New_Column'] = 'NA'
number_as_string = ['first', 'second', 'third']
for index, group in enumerate(df['Group'].unique()):
    unqiue_time = df[df['Group'] == group]['Date_Time_2'].unique()[index]
    df.loc[(df['Group'] == group) & (df['Date_Time_2'] == unqiue_time), 'New_Column'] = number_as_string[index]
print(df)