Python pandas - wide data - identify earliest and maximum columns in time series-CodePudding

I am working with a data frame that is written in wide format. Each book has a number of sales, but some quarters have null values because the book was not released before that quarter.

import pandas as pd

data = {'Book Title': ['A Court of Thorns and Roses', 'Where the Crawdads Sing', 'Bad Blood', 'Atomic Habits'],
    'Metric': ['Book Sales','Book Sales','Book Sales','Book Sales'],
   'Q1 2022': [100000,0,0,0],
   'Q2 2022': [50000,75000,0,35000],
   'Q3 2022': [25000,150000,20000,45000],
   'Q4 2022': [25000,20000,10000,65000]}

df1 = pd.DataFrame(data)

What I would like to do is create one field that identifies "ID of first available quarter" ("First Quarter ID"), and another that identifies "ID of quarter with maximum sales" ("Max Quarter ID"). Then I would like to show two fields with the sales in the first available quarter and the second available quarter.

Tips to go about this? Thank you!

CodePudding user response：

A possible solution:

df1 = df1.replace(0, np.nan)

a = df1.shape[1] - np.argmin(df1.notna().values[:, ::-1], axis=1)
a = df1.columns[np.where(a == df1.shape[1], 2, a)]
b = df1.iloc[:, 2:].columns[np.nanargmax(df1.iloc[:, 2:].values, axis=1)]
df1['First Quarter ID'] = a
df1['Max Quarter ID'] = b

Output:

                    Book Title      Metric   Q1 2022  Q2 2022  Q3 2022  \
0  A Court of Thorns and Roses  Book Sales  100000.0  50000.0    25000   
1      Where the Crawdads Sing  Book Sales       NaN  75000.0   150000   
2                    Bad Blood  Book Sales       NaN      NaN    20000   
3                Atomic Habits  Book Sales       NaN  35000.0    45000   

   Q4 2022 First Quarter ID Max Quarter ID  
0    25000          Q1 2022        Q1 2022  
1    20000          Q2 2022        Q3 2022  
2    10000          Q3 2022        Q3 2022  
3    65000          Q2 2022        Q4 2022

CodePudding user response：

Use a custom

CodePudding user response：

Maybe this is what you are looking for.

import pandas as pd
import numpy as np

data = {'Book Title': ['A Court of Thorns and Roses', 'Where the Crawdads Sing', 'Bad Blood', 'Atomic Habits'],
    'Metric': ['Book Sales','Book Sales','Book Sales','Book Sales'],
   'Q1 2022': [100000,0,0,0],
   'Q2 2022': [50000,75000,0,35000],
   'Q3 2022': [25000,150000,20000,45000],
   'Q4 2022': [25000,20000,10000,65000]}

df1 = pd.DataFrame(data)

df1['First Quarter ID'] = [df1.iloc[idx, 2:].replace(0, np.nan).first_valid_index() for idx in df1.index]
df1['Max Quarter ID'] =  df1.set_index(['Book Title', 'Metric']).iloc[:, :-1].idxmax(axis=1).to_list()