I have a data-frame formatted like so (I simplified it for the sake of my explanation):
Date_1 | Date_2 | Date_3 |
---|---|---|
2017-02-14 | 2017-02-09 | 2017-02-10 |
2018-07-16 | 2019-07-22 | 2018-07-16 |
2014-10-10 | 2017-10-10 | 2017-10-10 |
I would like to create a new column that shows the average difference between my date columns. Specifically, I would like it to calculate the difference between Date_1 & Date_2, Date_2 & Date_3, and Date_1 & Date_3. In row # 1 that would equal mean(5 1 4) = 3.33.
The data frame would look something like this:
Date_1 | Date_2 | Date_3 | Average_Difference |
---|---|---|---|
2017-02-14 | 2017-02-09 | 2017-02-10 | 3.33 |
2018-07-16 | 2019-07-22 | 2018-07-16 | mean(6 6 0) = 4 |
2014-10-10 | 2017-10-10 | 2017-10-10 | 0 |
Do let me know if further explanation is needed.
Edit: I should also add that my actual, un-simplified dataframe has more than just three date columns, so I am trying to think of an answer that is scalable.
CodePudding user response:
Interesting problem. Since you're getting the diffs of several items in each row, itertools.combinations(iterable, N)
will help. It returns all a possible N
-length combinations of the items in iterable
. So we can use that for each row, diff each combination, absolute it (since some might be negative because of the sorting), and compute the mean:
date_cols = df.filter(like='Date_').columns
df[date_cols] = df[date_cols].apply(pd.to_datetime) # Convert the columns to dates
df['Average_Difference'] = df[date_cols].apply(lambda row: np.mean([diff for diff in abs(np.diff(list(it.combinations([date.dayofyear for date in row], 2)))[:, 0])]), axis=1)
Output:
>>> df
Date_1 Date_2 Date_3 Average_Difference
0 2017-02-14 2017-02-09 2017-02-10 3.333333
1 2018-07-16 2019-07-22 2018-07-16 4.000000
2 2014-10-10 2017-10-10 2017-10-10 0.000000