Home > Software engineering >  Is there a way to extract the most current values when comparing multiple date columns
Is there a way to extract the most current values when comparing multiple date columns

Time:03-09

The question might be confusing but what I am trying to do is to extract the color from the most recent two years and also identify the year.

For example, I have a dataset below, for item1, the color for the most recent two years are red (2021) and black(2020), the desired outcome is to add 4 columns namely "this year color", "last year color", "this year", "last year" to show the corresponding information.

In the case of current year's color (2021) is missing, then the logic is to take the next year as current year, like item 3, this year's color is green (2020) and last year color is green (2018).

I wonder if there is a way to do it in python? Any advice or suggestion would be gladly appericated.

import pandas as pd 

df1 = { 'item':['item1','item2','item3','item4'],
  '2018':['red','green','',''],
  '2019':['black','','green','red'],
  '2020':['brown', 'blue','','blue'],
  '2021':['','','green','black']}

df1=pd.DataFrame(df1)
df1

The desired output is like below:

enter image description here

CodePudding user response:

IIUC, you can use a custom function:

def getlast(s):
    s = s[s.ne('')][:2]
    return pd.Series(s.to_list() s.index.to_list(),
                     index=['this year color', 'last year color', 'this year', 'last year'])
df2 = df1.join(df1
 .drop('item', axis=1)
 .sort_index(axis=1, ascending=False)
 .apply(getlast, axis=1)
)

output:

    item   2018   2019   2020   2021 this year color last year color this year last year
0  item1    red  black  brown                  brown           black      2020      2019
1  item2  green          blue                   blue           green      2020      2018
2  item3         green         green           green           green      2021      2019
3  item4           red   blue  black           black            blue      2021      2020

or using melt and merge:

df2 = df1.merge(df1
 .melt(id_vars='item')
 .loc[lambda d: d['value'].ne('')]
 .groupby('item')
 .agg(**{'this year color': ('value', lambda g: g.iloc[-1]),
       'last year color': ('value', lambda g: g.iloc[-2]),
       'this year': ('variable', lambda g: g.iloc[-1]),
       'last year': ('variable', lambda g: g.iloc[-2])}),
 left_on='item', right_index=True
)
  • Related