I'm currently trying to iterate through a dataframe/csv and compare the dates of the rows with the same ID. If the dates are different or are a certain time-frame apart I want to create a '1' in another column (not shown) to mark that ID and row/s.
I'm looking to save the DATE values as variables and compare them against other DATE variables with the same ID. If the dates are set amount of time apart I'll create a 1 in another column on the same row.
ID | DATE |
---|---|
1 | 11/11/2011 |
1 | 11/11/2011 |
2 | 5/05/2011 |
2 | 20/06/2011 |
3 | 2/04/2011 |
3 | 10/08/2011 |
4 | 8/12/2011 |
4 | 1/02/2012 |
4 | 12/03/2012 |
For this post, I'm mainly looking to save the multiple values as variables or a list. I'm hoping to figure out the rest once this roadblock has been removed.
Here's what I got currently, but I don't think it'll be much help. Currently it iterates through and converts the date strings to dates. Which is what I want to happen AFTER getting a list of all the dates with the same ID value.
import pandas as pd
from datetime import *
filename = 'TestData.csv'
df = pd.read_csv(filename)
print (df.iloc[0,1])
x = 0
for i in df.iloc:
FixDate = df.iloc[x, 1]
d1, m1, y1 = FixDate.split('/')
d1 = int(d1)
m1 = int(m1)
y1 = int(y1)
finaldate = date(y1, m1, d1)
print(finaldate)
x = x 1
Any help is appreciated, thank you!
CodePudding user response:
In pandas for performance is best avoid loops, if need new column tested if same values in DATE
per groups use GroupBy.transform
with DataFrameGroupBy.nunique
and then compare values by 1
:
df = pd.read_csv(filename)
df['test'] = df.groupby('ID')['DATE'].transform('nunique').eq(1).astype(int)
print (df)
ID DATE test
0 1 11/11/2011 1
1 1 11/11/2011 1
2 2 5/05/2011 0
3 2 20/06/2011 0
4 3 2/04/2011 0
5 3 10/08/2011 0
6 4 8/12/2011 0
7 4 1/02/2012 0
8 4 12/03/2012 0
If need filter matched rows:
mask = df.groupby('ID')['DATE'].transform('nunique').eq(1)
df1 = df[mask]
print (df1)
ID DATE
0 1 11/11/2011
1 1 11/11/2011
In last step convert values to lists:
IDlist = df1['ID'].tolist()