I've got this sample data frame. Each use has 2 rows.
They have an arrival and a departure date, and one of them is always None. The dates are string.
This is what my data currently looks like:
traveller_id | arrival | departure |
---|---|---|
282840560712311 | 2022-10-20 | None |
282840560712311 | None | 2022-10-23 |
439863739170884 | 2022-12-22 | None |
439863739170884 | None | 2022-12-25 |
import pandas as pd
import numpy as np
df = pd.DataFrame(data = {'traveller_id': [712311, 712311, 170884, 170884]
, 'arrival': ['2022-10-20', None, '2022-12-22', None]
, 'departure': [None, '2022-10-23', None, '2022-12-25']
})
The goal is to have only 1 row per user, with dates in the other columns (and no None).
It should look like this:
traveller_id | arrival | departure |
---|---|---|
282840560712311 | 2022-10-20 | 2022-10-23 |
439863739170884 | 2022-12-22 | 2022-12-25 |
CodePudding user response:
Replace None
with pd.NaT
and then do an agg with max
after groupby traveller_id
:
df.replace({None:pd.NaT}).groupby('traveller_id', as_index=False).agg(max)
output on your example from constructor:
traveller_id arrival departure
0 170884 2022-12-22 2022-12-25
1 712311 2022-10-20 2022-10-23
I assumed the strings as dates. If they are strings, yes you can convert them as dates first and no need to replace None
with pd.NaT
.
So with that:
df['arrival'] = pd.to_datetime(df['arrival'])
df['departure'] = pd.to_datetime(df['departure'])
df = df.groupby('traveller_id', as_index=False).agg(max)