I have a pandas dataframe that looks something like this:
ID | Category | Date |
---|---|---|
1 | A | 1/1/22 10:14:12 AM |
1 | A | 1/1/22 10:12:12 AM |
1 | B | 1/2/22 10:14:12 AM |
2 | A | 1/1/22 10:14:12 AM |
3 | A | 1/2/22 10:14:12 AM |
3 | B | 1/1/22 10:14:12 AM |
3 | B | 1/1/22 10:18:12 AM |
What I want is to get only the IDs that have category A and category B. Then, I want to calculate the absolute (value) time difference between their respective two dates (in hours). The ideal result would look something like:
ID | Time Difference |
---|---|
1 | 24 |
3 | 24 |
CodePudding user response:
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(["ID", "Date"], ascending=[1, 1])
df = (
df
.assign(DeltaHours=df.groupby(["ID"]).Date.diff().dt.days * 24)
.dropna()
.reset_index(drop=True)[["ID", "DeltaHours"]]
)
print(df)
ID DeltaHours
1 1 24.0
3 3 24.0
CodePudding user response:
Using a pivot
:
out = (df
.assign(Date=pd.to_datetime(df['Date']))
.pivot_table(index='ID', columns='Category', values='Date', aggfunc='max')
.pipe(lambda d: d['A'].sub(d['B']).abs().dt.total_seconds().div(3600))
.dropna()
.reset_index(name='Time Difference')
)
Output:
ID Time Difference
0 1 24.0
1 3 24.0
CodePudding user response:
here is one way to do it
# convert the date to datetime (unless its already datetime)
df['Date'] = pd.to_datetime(df['Date'])
# sort and drop duplicates, keeping latest
df=df.sort_values(['ID', 'Category','Date']).drop_duplicates(subset=['ID','Category','Date'], keep='last')
# pivot to put 'A' and 'B' as two columns
df2=df.pivot(index='ID', columns='Category', values='Date').reset_index()
# eliminate the rows where either of the two column (dates) are null
df2.dropna(inplace=True)
# this to get rid of spaces from Category, unless its already stripped of whitespaces characters
df2.columns = [col.strip() for col in df2.columns]
# calculate the difference
df2['time_difference']= df2['A'].sub(df2['B']).dt.total_seconds()/3600
df2
ID A B time_difference
0 1 2022-01-01 10:14:12 2022-01-02 10:14:12 -24.0
2 3 2022-01-02 10:14:12 2022-01-01 10:14:12 24.0
CodePudding user response:
Another possible solution, based on pandas.DataFrame.groupby
:
df['Date'] = pd.to_datetime(df['Date'])
g = df.groupby('ID')
(g.max()['Date'].sub(g.min()['Date']).dt.total_seconds().div(3600)
[g['ID'].count().ne(1)].reset_index())
Output:
ID time_diff
0 1 24.0
1 3 24.0