I have the below data frame,
ID | Group | Date_Time_1 | Date_Time_2 | Difference | New_Column |
---|---|---|---|---|---|
123 | A | 14-10-2021 15:19 | 14-10-2021 15:32 | 13 | First |
123 | A | 14-10-2021 15:19 | 14-10-2021 15:36 | 17 | null |
123 | A | 14-10-2021 15:19 | 14-10-2021 15:37 | 18 | null |
123 | A | 14-10-2021 15:19 | 14-10-2021 16:29 | 70 | null |
123 | A | 14-10-2021 15:19 | 14-10-2021 17:04 | 105 | null |
123 | B | 14-10-2021 15:21 | 14-10-2021 15:32 | 11 | null |
123 | B | 14-10-2021 15:21 | 14-10-2021 15:36 | 15 | Second |
123 | B | 14-10-2021 15:21 | 14-10-2021 15:37 | 16 | null |
123 | B | 14-10-2021 15:21 | 14-10-2021 16:29 | 68 | null |
123 | B | 14-10-2021 15:21 | 14-10-2021 17:04 | 103 | null |
123 | C | 14-10-2021 15:22 | 14-10-2021 15:32 | 10 | null |
123 | C | 14-10-2021 15:22 | 14-10-2021 15:36 | 14 | null |
123 | C | 14-10-2021 15:22 | 14-10-2021 15:37 | 15 | Third |
123 | C | 14-10-2021 15:23 | 14-10-2021 16:29 | 67 | Third_A |
123 | C | 14-10-2021 15:48 | 14-10-2021 17:04 | 102 | Third_B |
789 | A | 14-10-2021 15:19 | 14-10-2021 15:32 | 13 | First |
789 | A | 14-10-2021 15:19 | 14-10-2021 15:36 | 17 | null |
789 | B | 14-10-2021 15:21 | 14-10-2021 15:32 | 11 | null |
789 | B | 14-10-2021 15:21 | 14-10-2021 15:36 | 15 | Second |
789 | C | 14-10-2021 15:22 | 14-10-2021 15:32 | 10 | null |
I am trying to create a new column which will assign "First" to the smallest "Date_Time_2" in group "A" and it will assign "second" to the second smallest "Date_Time_2" in group B. Similarly, it will assign "third" to the third smallest "Date_Time_2" in group C.
I want it to assign "Third_A", "Third_B" and so on once the loop reaches the last "Group" of the "ID". So, once it reaches the last group of "ID" it will assign "Third or 3" (As there are only three unique groups in the dataset) to the third lowest "Date_Time_2" which is not used in the previous groups and if it will find another "Date_Time_2" for a new "Date_Time_1" it will assign "Third_A", "Third_B" and so on
I have tried the below code but it is not working,
`df.drop('New_Column', axis = 1, inplace = True)
df['New_Column'] = pd.Series()
for i, v in df['Difference'].items():
a = 0
b = 1
diff = df[df['Group'] == df['Group'].unique()[a]]['Difference'].nsmallest(b).min()
if diff == v:
df.loc[i, 'New_Column'] = "Yes"
b = b 1
a = a 1`
Any help here would be great!
CodePudding user response:
You could try the following:
from string import ascii_uppercase as letters
df["Date_Time_2"] = pd.to_datetime(df["Date_Time_2"])
for n, (_, gdf) in enumerate(df.sort_values("Date_Time_2").groupby("Group")):
nths = gdf.groupby("Date_Time_2", as_index=False).ngroup()
df.loc[gdf[nths == n].index, "New"] = str(n 1)
for i, c in zip(gdf[nths > n].index, letters):
df.at[i, "New"] = f"{n 1}_{c}"
- First make sure column
Date_Time_2
contains datetimes. - Then group
df
byGroup
after sorting alongDate_Time_2
. - Then in each group identify the indices belonging to
n
thDate_Time_2
sub-group (starting from 0) and setn 1
on the resp.New
column rows. - Then take the last group and add the lettered values to the
New
column.
Maybe you have to replace the last part with
for k, c in zip(range(n 1, nths.max() 1), letters):
df.loc[gdf[nths == k].index, "New"] = f"{n 1}_{c}"
if the lettered values should be grouped too.
Result for the sample in the question:
ID Group Date_Time_1 Date_Time_2 Difference New_Column New
0 123 A 14-10-2021 15:19 2021-10-14 15:32:00 13 First 1
1 123 A 14-10-2021 15:19 2021-10-14 15:36:00 17 NaN NaN
2 123 A 14-10-2021 15:19 2021-10-14 15:37:00 18 NaN NaN
3 123 A 14-10-2021 15:19 2021-10-14 16:29:00 70 NaN NaN
4 123 A 14-10-2021 15:19 2021-10-14 17:04:00 105 NaN NaN
5 123 B 14-10-2021 15:21 2021-10-14 15:32:00 11 NaN NaN
6 123 B 14-10-2021 15:21 2021-10-14 15:36:00 15 Second 2
7 123 B 14-10-2021 15:21 2021-10-14 15:37:00 16 NaN NaN
8 123 B 14-10-2021 15:21 2021-10-14 16:29:00 68 NaN NaN
9 123 B 14-10-2021 15:21 2021-10-14 17:04:00 103 NaN NaN
10 123 C 14-10-2021 15:22 2021-10-14 15:32:00 10 NaN NaN
11 123 C 14-10-2021 15:22 2021-10-14 15:36:00 14 NaN NaN
12 123 C 14-10-2021 15:22 2021-10-14 15:37:00 15 Third 3
13 123 C 14-10-2021 15:23 2021-10-14 16:29:00 67 Third_A 3_A
14 123 C 14-10-2021 15:48 2021-10-14 17:04:00 102 Third_B 3_B
15 789 A 14-10-2021 15:19 2021-10-14 15:32:00 13 First 1
16 789 A 14-10-2021 15:19 2021-10-14 15:36:00 17 NaN NaN
17 789 B 14-10-2021 15:21 2021-10-14 15:32:00 11 NaN NaN
18 789 B 14-10-2021 15:21 2021-10-14 15:36:00 15 Second 2
19 789 C 14-10-2021 15:22 2021-10-14 15:32:00 10 NaN NaN
CodePudding user response:
First, make sure you read csv value currectly. Means date time value should be interpreted correctly, e.g.
date_parse = lambda x : pd.to_datetime(x, format="%d-%m-%Y %H:%M")
df = pd.read_csv('filename.csv', parse_dates=['Date_Time_1','Date_Time_2'], date_parser= date_parse)
If you already have dataframe, you can use following code to parse datetime object insde dataframe,
df['Date_Time_2'] = pd.to_datetime(df['Date_Time_2'], format="%d-%m-%Y %H:%M")
df['Date_Time_2'] = pd.to_datetime(df['Date_Time_2'], format="%d-%m-%Y %H:%M")
Now just iterate over different groups, and filter out the date_time_2 column in sorted list, finally take out the appropriate index, e.g. for group 'A' take '0' index, for group 'B' take out '1' index ..., Select the dataframe appropriately and update the value in new column
df['New_Column'] = 'NA'
for index, group in enumerate(df['Group'].unique()):
unqiue_time = df[df['Group'] == group]['Date_Time_2'].unique()[index]
df.loc[(df['Group'] == group) & (df['Date_Time_2'] == unqiue_time), 'New_Column'] = index
print(df)
Note: appending number is lot easier then, word like 'first', 'second', if you want, create a new list, and assign the value from index, like below
df['New_Column'] = 'NA'
number_as_string = ['first', 'second', 'third']
for index, group in enumerate(df['Group'].unique()):
unqiue_time = df[df['Group'] == group]['Date_Time_2'].unique()[index]
df.loc[(df['Group'] == group) & (df['Date_Time_2'] == unqiue_time), 'New_Column'] = number_as_string[index]
print(df)