Home > Net >  Python Pandas Group By Flags and depending second flag
Python Pandas Group By Flags and depending second flag

Time:01-05

this is a simplified example for my problem. I have a dataframe with filenames and modification dates. I need a flag for the latest files by filename. Latest =1; Not latest =0

I have this code so far:

    import pandas as pd
    
   df = pd.DataFrame({
     'FileName' : ['file1.txt', 'file2.txt', 'file3.txt', 'file1.txt', 'file4.txt', 'file3.txt'],
     'FileModDate' : ['2022-02-22 10:28:18', '2022-02-22 11:28:18', '2022-02-22 12:28:18', '2022-02-22 14:28:18', '2022-02-22 08:28:18', '2022-02-22 15:28:18'],
     'DataDate' : ['2024-02-22 10:28:18', '2021-02-22 11:28:18', '2021-02-22 12:28:18', '2021-02-22 14:28:18', '2021-02-22 08:28:18', '2021-02-22 15:28:18']
})
    
    df=df.sort_values('FileModDate',ascending=False)
    
    print (df)
    
    
    grouped=df.groupby('FileName').first()
    grouped['LatestFile']=1
    print (grouped)

The Result is:

                       FileModDate  LatestFile
FileName
file1.txt  2022-02-22 14:28:18           1
file2.txt  2022-02-22 11:28:18           1
file3.txt  2022-02-22 15:28:18           1
file4.txt  2022-02-22 08:28:18           1

I don't know if this is the right approach. How can I get the rows into this dataframe which where not in grouby.first().

So the result should look like this:

                   FileModDate  LatestFile
FileName
file1.txt  2022-02-22 14:28:18           1
file2.txt  2022-02-22 11:28:18           1
file3.txt  2022-02-22 15:28:18           1
file4.txt  2022-02-22 08:28:18           1
file3.txt  2022-02-22 12:28:18           0
file1.txt  2022-02-22 10:28:18           0

Best Zio

Edit:

I would need one more flag depending on the first Flag:

The Flag DataDate should only be 1 if LatestFile=1, so the output is:

    FileName          FileModDate             DataDate  LatestFile  DataDateFlag
0  file1.txt  2022-02-22 10:28:18  2024-02-22 10:28:18           0             0
1  file2.txt  2022-02-22 11:28:18  2021-02-22 11:28:18           1             1
2  file3.txt  2022-02-22 12:28:18  2021-02-22 12:28:18           0             0
3  file1.txt  2022-02-22 14:28:18  2021-02-22 14:28:18           1             1
4  file4.txt  2022-02-22 08:28:18  2021-02-22 08:28:18           1             1
5  file3.txt  2022-02-22 15:28:18  2021-02-22 15:28:18           1             1

I have tried something like:

df["DataDateFlag"] = (
    df
    .groupby("FileName")["DataDate"]
    .transform("max")
    .eq(df["DataDate"])
    .astype(int)
    .filter(df["LatestFile"]==1)
)

CodePudding user response:

You can transform each group to its maximum date: that will save you the requirement to sort the array, and it lets you compare directly to the actual dates:

import pandas as pd

df = pd.DataFrame({
     'FileName' : ['file1.txt', 'file2.txt', 'file3.txt', 'file1.txt', 'file4.txt', 'file3.txt'],
     'FileModDate' : ['2022-02-22 10:28:18', '2022-02-22 11:28:18', '2022-02-22 12:28:18', '2022-02-22 14:28:18', '2022-02-22 08:28:18', '2022-02-22 15:28:18']
})

df["LatestFile"] = (
    df
    .groupby("FileName")["FileModDate"]
    .transform("max")
    .eq(df["FileModDate"])
    .astype(int)
)

output (in original ordering):

    FileName          FileModDate  LatestFile
0  file1.txt  2022-02-22 10:28:18           0
1  file2.txt  2022-02-22 11:28:18           1
2  file3.txt  2022-02-22 12:28:18           0
3  file1.txt  2022-02-22 14:28:18           1
4  file4.txt  2022-02-22 08:28:18           1
5  file3.txt  2022-02-22 15:28:18           1

CodePudding user response:

You could use booleans instead of 0/1:

df['LatestFile'] = df['FileModDate'] == df.groupby('FileName')['FileModDate'].transform(max)

Output:

    FileName          FileModDate  LatestFile
5  file3.txt  2022-02-22 15:28:18        True
3  file1.txt  2022-02-22 14:28:18        True
2  file3.txt  2022-02-22 12:28:18       False
1  file2.txt  2022-02-22 11:28:18        True
0  file1.txt  2022-02-22 10:28:18       False
4  file4.txt  2022-02-22 08:28:18        True

CodePudding user response:

If your data is already sorted by date, you could enumerate the group items with groupby.cumcount to select the first (0):

df['LatestFile'] = df.groupby('FileName').cumcount().eq(0).astype(int)

Output:

    FileName          FileModDate  LatestFile
5  file3.txt  2022-02-22 15:28:18           1
3  file1.txt  2022-02-22 14:28:18           1
2  file3.txt  2022-02-22 12:28:18           0
1  file2.txt  2022-02-22 11:28:18           1
0  file1.txt  2022-02-22 10:28:18           0
4  file4.txt  2022-02-22 08:28:18           1
  • Related