Issue 1 - solved by using pd.to_datetime(df.Date, format='%Y-%m-%d'). Thanks to Michael
I am trying to find the latest date of each user using their ID
df['Latest Date'] = df.groupby(['ID'])['Date'].transform.('max')
df.drop_duplicates(subset='ID', keep='last',inplace=True)
But I am getting '>=' not supported between instances of 'str' and 'float'
I have used the same approach in the past and it worked fine.
When I did dytypes, I see 'ID' column is int64 and Date column as object because I converted the date column to df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%Y-%m-%d')
Issue 2 solved - See Michael's comment 'For the edit'
But the output does not look right
I am trying to find the latest date of each user using their ID and assign those dates to new columns using the category
Dataframe = df
Data looks like below,
ID CATEGORY NAME DATE
1 fruits 2017-08-07 00:00:00
2 veggies 2018-01-25 00:00:00
1 fruits 2015-08-07 00:00:00
2 veggies 2022-01-01 00:00:00
My code is below
//Converting the date format
df['Date'] = pd.to_datetime(df.Date, format='%Y-%m-%d')
//transforming to identify the latest date
df['Latest Date'] = df.groupby(['ID'])['Date'].transform.('max')
//keeping the last and dropping the duplicates
df.drop_duplicates(subset='ID', keep='last',inplace=True)
//inserting new columns
df['Fruits'] = ' '
df['Veggies'] = ' '
//applying the latest dates to the newly created columns
df.loc[((df['CATEGORY NAME'] == 'fruits')), 'Fruits'] = df['Latest Date']
df.loc[((df['CATEGORY NAME'] == 'veggies')), 'Veggies'] = df['Latest Date']
I want the output like below
ID CATEGORY NAME DATE Latest Date Fruits Veggies
1 fruits 2017-08-07 2017-08-07 2017-08-07
2 veggies 2022-01-01 2022-01-01 2022-01-01
But my output looks odd. I don't have an error message but the output is not right
ID CATEGORY NAME DATE Latest Date Fruits Veggies
1 fruits 2017-08-07
2 veggies 2022-01-01 2021-01-01 2021-01-01 00:00:00
If you notice above
- It did not identify the latest correctly
- When applying the date values to the new column, its 00:00:00 time format also shows up
- It did not drop duplicates
Not sure what's wrong
CodePudding user response:
strftime
converts a date to string. Did you want to keep it as a datetime object but change the format? Try this instead:
df.Date = pd.to_datetime(df.Date, format='%Y-%m-%d')
For the Edit
I'm not sure why you want the "Date" and "Latest Date" columns to be the same, but here is the code that will give you your desired table output:
# Recreate dataframe
ID = [1,2,1,2]
CATEGORY_NAME = ["fruits", "veggies", "fruits", "veggies"]
DATE = ["2017-08-07 00:00:00", "2018-01-25 00:00:00", "2015-08-07 00:00:00", "2022-01-01 00:00:00"]
df = pd.DataFrame({"ID":ID,"CATEGORY NAME":CATEGORY_NAME, "Date":DATE})
# Convert datetime format
df['Date'] = pd.to_datetime(df.Date, format='%Y-%m-%d')
# Get the max date value and assign the group to a new dataframe
dfNew = df.groupby(['ID'], as_index=False).max()
# The new dataframes Date and Latest Date column are the same
dfNew['Latest Date'] = dfNew['Date']
# Fix latest Date formatting
dfNew["Latest Date"] = dfNew["Latest Date"].dt.date
# Add fruit and veggie columns
dfNew['Fruits'] = ' '
dfNew['Veggies'] = ' '
# Place in the desired values
dfNew.loc[((dfNew['CATEGORY NAME'] == 'fruits')), 'Fruits'] = dfNew['Latest Date']
dfNew.loc[((dfNew['CATEGORY NAME'] == 'veggies')), 'Veggies'] = dfNew['Latest Date']
dfNew
Output:
ID CATEGORY NAME Date Latest Date Fruits Veggies
0 1 fruits 2017-08-07 2017-08-07 2017-08-07
1 2 veggies 2022-01-01 2022-01-01 2022-01-01
CodePudding user response:
Are you sure 'Date' is not mixed? It looks to me as if it still contains 'str' and 'float', and maybe 'datetime'. You sure it's clean, and all records converted?