Pandas - add value to series indicating the first time each value in one series appears in each year-CodePudding

I have a dataframe that looks like the following:

Date/time	File
2021-01-01 09:00:00	File A
2021-02-01 09:00:00	File A
2022-01-01 09:00:00	File A
2021-01-01 09:00:00	File B
2022-01-01 09:00:00	File B
2022-01-01 09:00:00	File C
2022-02-01 09:00:00	File C
2022-02-01 10:00:00	File C

What I am trying to do, is add a column for 'First occurrence' where each File is marked with a 1 for the first time it appears in each year

So the desired output would be:

Date/time	File	First occurrence
2021-01-01 09:00:00	File A	1
2021-02-01 09:00:00	File A
2022-01-01 09:00:00	File A	1
2021-01-01 09:00:00	File B	1
2022-01-01 09:00:00	File B	1
2022-01-01 09:00:00	File C	1
2022-02-01 09:00:00	File C
2022-02-01 10:00:00	File C

What I've tried so far which works:

df= df.sort_values(["File", "Date/time"])
df.reset_index(drop=True, inplace=True)
df["First occurrence"] = df.groupby(["File"]).cumcount() 
df["First occurrence"] = np.where(df["First scan"] == 0, 1, 0)

This works at the first occurrence for each file, but doesn't work for each year. Building on that, I tried groupby methods and couldn't get it working so I went with a brute force approach which is:

# Get list of years
year_list = df["Date/time"].dt.year.unique().tolist()

year_frames = []
for year in year_list:
    yr_df = df[df["Date/time"].dt.year==year]
    yr_df = yr_df.sort_values(["File", "Date/time"])
    yr_df = yr_df.reset_index(drop=True)

    yr_df["First scan"] = yr_df.groupby(["File"]).cumcount() 
    yr_df["First scan"] = np.where(yr_df["First scan"] == 0, 1, 0)
    
    year_frames.append(yr_df) 

final_df = pd.concat(year_frames, axis=0, ignore_index=True)

This works for each year df but the concat overwrites values of 1 and 0 as the two frames are joined.

I'd like to get better with Pandas so would rather not brute force it in this way if there's a suitable way to do it but so far hours worth of Googling hasn't give me much to go on!

CodePudding user response：

You can create a year column and use DataFrame.duplicated to check the duplicated item in year and File columns

df['Date/time'] = pd.to_datetime(df['Date/time'])
df['year'] = df['Date/time'].dt.year
df['First occurrence'] = (~df.duplicated(['year', 'File'])).astype(int)

print(df)

            Date/time    File  year  First occurrence
0 2021-01-01 09:00:00  File A  2021                 1
1 2021-02-01 09:00:00  File A  2021                 0
2 2022-01-01 09:00:00  File A  2022                 1
3 2021-01-01 09:00:00  File B  2021                 1
4 2022-01-01 09:00:00  File B  2022                 1
5 2022-01-01 09:00:00  File C  2022                 1
6 2022-02-01 09:00:00  File C  2022                 0
7 2022-02-01 10:00:00  File C  2022                 0