this is a simplified example for my problem. I have a dataframe with filenames and modification dates. I need a flag for the latest files by filename. Latest =1; Not latest =0
I have this code so far:
import pandas as pd
df = pd.DataFrame({
'FileName' : ['file1.txt', 'file2.txt', 'file3.txt', 'file1.txt', 'file4.txt', 'file3.txt'],
'FileModDate' : ['2022-02-22 10:28:18', '2022-02-22 11:28:18', '2022-02-22 12:28:18', '2022-02-22 14:28:18', '2022-02-22 08:28:18', '2022-02-22 15:28:18'],
'DataDate' : ['2024-02-22 10:28:18', '2021-02-22 11:28:18', '2021-02-22 12:28:18', '2021-02-22 14:28:18', '2021-02-22 08:28:18', '2021-02-22 15:28:18']
})
df=df.sort_values('FileModDate',ascending=False)
print (df)
grouped=df.groupby('FileName').first()
grouped['LatestFile']=1
print (grouped)
The Result is:
FileModDate LatestFile
FileName
file1.txt 2022-02-22 14:28:18 1
file2.txt 2022-02-22 11:28:18 1
file3.txt 2022-02-22 15:28:18 1
file4.txt 2022-02-22 08:28:18 1
I don't know if this is the right approach. How can I get the rows into this dataframe which where not in grouby.first().
So the result should look like this:
FileModDate LatestFile
FileName
file1.txt 2022-02-22 14:28:18 1
file2.txt 2022-02-22 11:28:18 1
file3.txt 2022-02-22 15:28:18 1
file4.txt 2022-02-22 08:28:18 1
file3.txt 2022-02-22 12:28:18 0
file1.txt 2022-02-22 10:28:18 0
Best Zio
Edit:
I would need one more flag depending on the first Flag:
The Flag DataDate should only be 1 if LatestFile=1, so the output is:
FileName FileModDate DataDate LatestFile DataDateFlag
0 file1.txt 2022-02-22 10:28:18 2024-02-22 10:28:18 0 0
1 file2.txt 2022-02-22 11:28:18 2021-02-22 11:28:18 1 1
2 file3.txt 2022-02-22 12:28:18 2021-02-22 12:28:18 0 0
3 file1.txt 2022-02-22 14:28:18 2021-02-22 14:28:18 1 1
4 file4.txt 2022-02-22 08:28:18 2021-02-22 08:28:18 1 1
5 file3.txt 2022-02-22 15:28:18 2021-02-22 15:28:18 1 1
I have tried something like:
df["DataDateFlag"] = (
df
.groupby("FileName")["DataDate"]
.transform("max")
.eq(df["DataDate"])
.astype(int)
.filter(df["LatestFile"]==1)
)
CodePudding user response:
You can transform each group to its maximum date: that will save you the requirement to sort the array, and it lets you compare directly to the actual dates:
import pandas as pd
df = pd.DataFrame({
'FileName' : ['file1.txt', 'file2.txt', 'file3.txt', 'file1.txt', 'file4.txt', 'file3.txt'],
'FileModDate' : ['2022-02-22 10:28:18', '2022-02-22 11:28:18', '2022-02-22 12:28:18', '2022-02-22 14:28:18', '2022-02-22 08:28:18', '2022-02-22 15:28:18']
})
df["LatestFile"] = (
df
.groupby("FileName")["FileModDate"]
.transform("max")
.eq(df["FileModDate"])
.astype(int)
)
output (in original ordering):
FileName FileModDate LatestFile
0 file1.txt 2022-02-22 10:28:18 0
1 file2.txt 2022-02-22 11:28:18 1
2 file3.txt 2022-02-22 12:28:18 0
3 file1.txt 2022-02-22 14:28:18 1
4 file4.txt 2022-02-22 08:28:18 1
5 file3.txt 2022-02-22 15:28:18 1
CodePudding user response:
You could use booleans instead of 0/1:
df['LatestFile'] = df['FileModDate'] == df.groupby('FileName')['FileModDate'].transform(max)
Output:
FileName FileModDate LatestFile
5 file3.txt 2022-02-22 15:28:18 True
3 file1.txt 2022-02-22 14:28:18 True
2 file3.txt 2022-02-22 12:28:18 False
1 file2.txt 2022-02-22 11:28:18 True
0 file1.txt 2022-02-22 10:28:18 False
4 file4.txt 2022-02-22 08:28:18 True
CodePudding user response:
If your data is already sorted by date, you could enumerate the group items with groupby.cumcount
to select the first (0
):
df['LatestFile'] = df.groupby('FileName').cumcount().eq(0).astype(int)
Output:
FileName FileModDate LatestFile
5 file3.txt 2022-02-22 15:28:18 1
3 file1.txt 2022-02-22 14:28:18 1
2 file3.txt 2022-02-22 12:28:18 0
1 file2.txt 2022-02-22 11:28:18 1
0 file1.txt 2022-02-22 10:28:18 0
4 file4.txt 2022-02-22 08:28:18 1