This question seems super easy but I just can't figure it out.
I have the following dataframe:
watched_df = pd.DataFrame([['Love Death Robots', '2016-01-29 14:04:22', 'Anna'],
['James Bond', '2016-02-29 14:04:22', 'Anna'],
['Lord of the Rings', '2016-09-29 14:04:22', 'BoB'],
['The Hobbit', '2016-10-29 14:04:22', 'Anna'],
['Beauty and the Beast', '2016-06-29 14:04:22', 'BoB']],
columns=['Title', 'Date', 'Profile_Name'])
watched_df.head()
For each user I want to know what is the 'Title' of the first movie the 'Profile_Name' has watched and on what 'date' it was.
So the wanted result is:
wanted_df = pd.DataFrame([['Love Death Robots', '2016-01-29 14:04:22', 'Anna'],
['Beauty and the Beast', '2016-06-29 14:04:22', 'BoB']],
columns=['Title', 'Date', 'Profile_Name'])
wanted_df.head()
I managed to get a dataframe that is indexed by the 'Profile_Name' and the respective oldest dates. But I cannot manage to add the 'Title' to the dataframe
CodePudding user response:
First you need to change the data type of "Date" to datetime:
watched_df['Date'] = pd.to_datetime(watched_df['Date'])
Then you just need to sort by date first and remove duplicates by keeping it the first row:
watched_df.sort_values('Date').drop_duplicates(['Profile_Name'], keep='first')
This should give the result you are looking for.