How to create a new column based on the mean difference between date columns in a data frame?-CodePudding

I have a data-frame formatted like so (I simplified it for the sake of my explanation):

Date_1	Date_2	Date_3
2017-02-14	2017-02-09	2017-02-10
2018-07-16	2019-07-22	2018-07-16
2014-10-10	2017-10-10	2017-10-10

I would like to create a new column that shows the average difference between my date columns. Specifically, I would like it to calculate the difference between Date_1 & Date_2, Date_2 & Date_3, and Date_1 & Date_3. In row # 1 that would equal mean(5 1 4) = 3.33.

The data frame would look something like this:

Date_1	Date_2	Date_3	Average_Difference
2017-02-14	2017-02-09	2017-02-10	3.33
2018-07-16	2019-07-22	2018-07-16	mean(6 6 0) = 4
2014-10-10	2017-10-10	2017-10-10	0

Do let me know if further explanation is needed.

Edit: I should also add that my actual, un-simplified dataframe has more than just three date columns, so I am trying to think of an answer that is scalable.

CodePudding user response：

Interesting problem. Since you're getting the diffs of several items in each row, itertools.combinations(iterable, N) will help. It returns all a possible N-length combinations of the items in iterable. So we can use that for each row, diff each combination, absolute it (since some might be negative because of the sorting), and compute the mean:

date_cols = df.filter(like='Date_').columns
df[date_cols] = df[date_cols].apply(pd.to_datetime) # Convert the columns to dates
df['Average_Difference'] = df[date_cols].apply(lambda row: np.mean([diff for diff in abs(np.diff(list(it.combinations([date.dayofyear for date in row], 2)))[:, 0])]), axis=1)

Output:

>>> df
      Date_1     Date_2     Date_3  Average_Difference
0 2017-02-14 2017-02-09 2017-02-10            3.333333
1 2018-07-16 2019-07-22 2018-07-16            4.000000
2 2014-10-10 2017-10-10 2017-10-10            0.000000