Find All IDs with that have both categories and calculate time difference-CodePudding

I have a pandas dataframe that looks something like this:

ID	Category	Date
1	A	1/1/22 10:14:12 AM
1	A	1/1/22 10:12:12 AM
1	B	1/2/22 10:14:12 AM
2	A	1/1/22 10:14:12 AM
3	A	1/2/22 10:14:12 AM
3	B	1/1/22 10:14:12 AM
3	B	1/1/22 10:18:12 AM

What I want is to get only the IDs that have category A and category B. Then, I want to calculate the absolute (value) time difference between their respective two dates (in hours). The ideal result would look something like:

ID	Time Difference
1	24
3	24

CodePudding user response：

Using .groupby .diff .assign

df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(["ID", "Date"], ascending=[1, 1])
df = (
    df
    .assign(DeltaHours=df.groupby(["ID"]).Date.diff().dt.days * 24)
    .dropna()
    .reset_index(drop=True)[["ID", "DeltaHours"]]
)
print(df)

   ID  DeltaHours
1   1  24.0
3   3  24.0

CodePudding user response：

Using a pivot:

out = (df
   .assign(Date=pd.to_datetime(df['Date']))
   .pivot_table(index='ID', columns='Category', values='Date', aggfunc='max')
   .pipe(lambda d: d['A'].sub(d['B']).abs().dt.total_seconds().div(3600))
   .dropna()
   .reset_index(name='Time Difference')
 )

Output:

   ID  Time Difference
0   1             24.0
1   3             24.0

CodePudding user response：

here is one way to do it

# convert the date to datetime (unless its already datetime)
df['Date'] = pd.to_datetime(df['Date'])

# sort and drop duplicates, keeping latest
df=df.sort_values(['ID', 'Category','Date']).drop_duplicates(subset=['ID','Category','Date'], keep='last')

# pivot to put 'A' and 'B' as two columns
df2=df.pivot(index='ID', columns='Category', values='Date').reset_index()

# eliminate the rows where either of the two column (dates) are null
df2.dropna(inplace=True)



# this to get rid of spaces from Category, unless its already stripped of whitespaces characters
df2.columns = [col.strip() for col in df2.columns]


# calculate the difference
df2['time_difference']= df2['A'].sub(df2['B']).dt.total_seconds()/3600
df2


ID  A   B   time_difference
0   1   2022-01-01 10:14:12     2022-01-02 10:14:12     -24.0
2   3   2022-01-02 10:14:12     2022-01-01 10:14:12     24.0

CodePudding user response：

Another possible solution, based on pandas.DataFrame.groupby:

df['Date'] = pd.to_datetime(df['Date'])
g = df.groupby('ID')
(g.max()['Date'].sub(g.min()['Date']).dt.total_seconds().div(3600)
 [g['ID'].count().ne(1)].reset_index())

Output:

   ID  time_diff
0   1       24.0
1   3       24.0