Home > Software design >  Is there a way to transpose the columns in the following dataframe using pandas functions and/or met
Is there a way to transpose the columns in the following dataframe using pandas functions and/or met

Time:12-10

I have a dataframe like the following:

Person Date1 Date2 Date3 Code1 Code2 Amount1-1 Amount1-2 Amount1-3 Amount2-1 Amount2-2 Amount2-3
John 12/1/21 12/2/21 REGULAR OVERTIME 8 8 0 2 3 0
Mary 12/2/21 12/3/21 REGULAR OVERTIME 0 8 8 0 1 0

The first number after each "Amount" (Amount1-2) corresponds to the number after the "Code" columns (Code1) and the second number after each "Amount" (Amount1-2) corresponds to the number after the "Date" columns (Date2).

I want to transpose the dataframe to the following:

Person Date Code Amount
John 12/1/21 REGULAR 8
John 12/1/21 OVERTIME 2
John 12/2/21 REGULAR 8
John 12/2/21 OVERTIME 3
Mary 12/2/21 REGULAR 8
Mary 12/2/21 OVERTIME 1
Mary 12/3/21 REGULAR 8

I am currently using a nested for loop to manually populate the dataframe I need, but I am hoping that there is a function or module in pandas that will let me achieve what I need in far less time. Currently, with a much larger file size, this transpose takes over 4 hours.

Edit: The loop I am using is below. I have adapted it for the condensed example above. I am appending each entry to an empty list then creating a new dataframe using the list. In my complete dataframe there are 42 dates, 20 codes, and 840 amount per persons. A person can have all, none, or some of those values populated in each column. I am also checking if each combination of date, code, and amount has no values. If so, I ignore it. I also ignore any data/code combinations that have a zero amount. It's nasty...

df = pd.DataFrame({'Person' : ['John', 'Mary'], 'Date1' : ['12/1/21', ''],\
                   'Date2' : ['12/2/21', '12/2/21'], 'Date3' : ['', '12/3/21'], 'Code1' : ['REGULAR', 'REGULAR'],\
                   'Code2' : ['OVERTIME', 'OVERTIME'], 'Amount1-1' : [8, 0],\
                   'Amount1-2' : [8, 8], 'Amount1-3' : [0, 8], 'Amount2-1' : [2, 0],\
                   'Amount2-2' : [3, 1], 'Amount2-3' : [0, 0]})
record_list = []
cols = list(df.columns)
cols.insert(0, "ignore") # .itertuples starts index at 1 instead of 0

for row in df.itertuples():
    for i in range(2, 5):
        for j in range (5, 7):
            for k in range(7, 13):
                if cols[i].strip("Date") == cols[k].strip("Amount").split("-")[1] \
                and cols[j].strip("Code") == cols[k].strip("Amount").split("-")[0]:
                    if not row[i].isspace() and not row[j].isspace() \
                    and float(row[k]) != 0:
                        record_list.append([row[1], row[i], row[j], row[k]])
new_columns = ['Person', 'Date', 'Code', 'Amount']
transposed_df = pd.DataFrame(record_list, columns = new_columns) 

CodePudding user response:

one way is to put the needed values into lists and use explode (no idea if this is faster than your method though), here goes:

def get_num_lists(df, col_pref='Amount'):
    cols = [c for c in df.columns if 'Amount' in c]
    list_2 = [x[-1] for x in cols]
    list_1 = [x[-3] for x in cols]
    return list_1, list_2

def num_list_to_values(df, col_pref, num_list):
    return [df[col_pref str(x)].values[0] for x in num_list]    

def explode_special(df):    
    df0 = pd.DataFrame()
    df0['Person'] = df['Person']
    df0['Amount'] = [df[[c for c in df.columns if 'Amount' in c]].values[0]]
    list1, list2 = get_num_lists(df)
    df0['Date'] = [num_list_to_values(df, 'Date', list1)]
    df0['Code'] = [num_list_to_values(df, 'Code', list2)]
    return df0.explode(['Amount', 'Date', 'Code'])

then

explode_special(df)
Person Amount Date Code
0 John 8 12/1/21 REGULAR
0 John 2 12/1/21 OVERTIME
0 John 8 12/2/21 REGULAR
0 John 3 12/2/21 OVERTIME

for applying on many rows you can use groupby('Person').apply(explode_special) or itterrows (with some changes to the functions)

CodePudding user response:

One option is to use Pandas methods, which should be faster than a for loop:

# set `person` as index; makes it easy to filter and dice the columns
temp = df.set_index('Person')

# reshape columns, splitting the numbers out
temp.columns = temp.columns.str.split('([a-zA-Z] )', expand=True).droplevel(0)
#temporary dataframe for date, code, amount
dates = temp.filter(like='Date')
codes = temp.filter(like='Code')
amount = temp.filter(like='Amount')
# put the numbers into a column
dates = dates.stack(level=1).rename_axis(['Person', 'date'])
codes = codes.stack(level=1).rename_axis(['Person', 'code'])
amount.columns = amount.columns.droplevel(0).str.split('-', expand=True)
amount.columns.names = ['code', 'date']
amount = amount.stack(level=['code', 'date']).reset_index(name='Amount')
# join the tables and filter the final dataframe for only rows
# where amount is greater than 0
(codes.merge(amount, left_index = True, right_on = ['Person', 'code'])
      .merge(dates, left_on=['Person', 'date'], right_index=True)
      .loc[lambda df: df.Amount.gt(0), ['Person', 'Date', 'Code', 'Amount']]
)

   Person     Date      Code  Amount
0    John  12/1/21   REGULAR       8
3    John  12/1/21  OVERTIME       2
1    John  12/2/21   REGULAR       8
4    John  12/2/21  OVERTIME       3
7    Mary  12/2/21   REGULAR       8
10   Mary  12/2/21  OVERTIME       1
8    Mary  12/3/21   REGULAR       8

  • Related