I have a dataframe like the following:
Person | Date1 | Date2 | Date3 | Code1 | Code2 | Amount1-1 | Amount1-2 | Amount1-3 | Amount2-1 | Amount2-2 | Amount2-3 |
---|---|---|---|---|---|---|---|---|---|---|---|
John | 12/1/21 | 12/2/21 | REGULAR | OVERTIME | 8 | 8 | 0 | 2 | 3 | 0 | |
Mary | 12/2/21 | 12/3/21 | REGULAR | OVERTIME | 0 | 8 | 8 | 0 | 1 | 0 |
The first number after each "Amount" (Amount1-2) corresponds to the number after the "Code" columns (Code1) and the second number after each "Amount" (Amount1-2) corresponds to the number after the "Date" columns (Date2).
I want to transpose the dataframe to the following:
Person | Date | Code | Amount |
---|---|---|---|
John | 12/1/21 | REGULAR | 8 |
John | 12/1/21 | OVERTIME | 2 |
John | 12/2/21 | REGULAR | 8 |
John | 12/2/21 | OVERTIME | 3 |
Mary | 12/2/21 | REGULAR | 8 |
Mary | 12/2/21 | OVERTIME | 1 |
Mary | 12/3/21 | REGULAR | 8 |
I am currently using a nested for loop to manually populate the dataframe I need, but I am hoping that there is a function or module in pandas that will let me achieve what I need in far less time. Currently, with a much larger file size, this transpose takes over 4 hours.
Edit: The loop I am using is below. I have adapted it for the condensed example above. I am appending each entry to an empty list then creating a new dataframe using the list. In my complete dataframe there are 42 dates, 20 codes, and 840 amount per persons. A person can have all, none, or some of those values populated in each column. I am also checking if each combination of date, code, and amount has no values. If so, I ignore it. I also ignore any data/code combinations that have a zero amount. It's nasty...
df = pd.DataFrame({'Person' : ['John', 'Mary'], 'Date1' : ['12/1/21', ''],\
'Date2' : ['12/2/21', '12/2/21'], 'Date3' : ['', '12/3/21'], 'Code1' : ['REGULAR', 'REGULAR'],\
'Code2' : ['OVERTIME', 'OVERTIME'], 'Amount1-1' : [8, 0],\
'Amount1-2' : [8, 8], 'Amount1-3' : [0, 8], 'Amount2-1' : [2, 0],\
'Amount2-2' : [3, 1], 'Amount2-3' : [0, 0]})
record_list = []
cols = list(df.columns)
cols.insert(0, "ignore") # .itertuples starts index at 1 instead of 0
for row in df.itertuples():
for i in range(2, 5):
for j in range (5, 7):
for k in range(7, 13):
if cols[i].strip("Date") == cols[k].strip("Amount").split("-")[1] \
and cols[j].strip("Code") == cols[k].strip("Amount").split("-")[0]:
if not row[i].isspace() and not row[j].isspace() \
and float(row[k]) != 0:
record_list.append([row[1], row[i], row[j], row[k]])
new_columns = ['Person', 'Date', 'Code', 'Amount']
transposed_df = pd.DataFrame(record_list, columns = new_columns)
CodePudding user response:
one way is to put the needed values into lists and use explode (no idea if this is faster than your method though), here goes:
def get_num_lists(df, col_pref='Amount'):
cols = [c for c in df.columns if 'Amount' in c]
list_2 = [x[-1] for x in cols]
list_1 = [x[-3] for x in cols]
return list_1, list_2
def num_list_to_values(df, col_pref, num_list):
return [df[col_pref str(x)].values[0] for x in num_list]
def explode_special(df):
df0 = pd.DataFrame()
df0['Person'] = df['Person']
df0['Amount'] = [df[[c for c in df.columns if 'Amount' in c]].values[0]]
list1, list2 = get_num_lists(df)
df0['Date'] = [num_list_to_values(df, 'Date', list1)]
df0['Code'] = [num_list_to_values(df, 'Code', list2)]
return df0.explode(['Amount', 'Date', 'Code'])
then
explode_special(df)
Person | Amount | Date | Code | |
---|---|---|---|---|
0 | John | 8 | 12/1/21 | REGULAR |
0 | John | 2 | 12/1/21 | OVERTIME |
0 | John | 8 | 12/2/21 | REGULAR |
0 | John | 3 | 12/2/21 | OVERTIME |
for applying on many rows you can use groupby('Person').apply(explode_special) or itterrows (with some changes to the functions)
CodePudding user response:
One option is to use Pandas methods, which should be faster than a for loop:
# set `person` as index; makes it easy to filter and dice the columns
temp = df.set_index('Person')
# reshape columns, splitting the numbers out
temp.columns = temp.columns.str.split('([a-zA-Z] )', expand=True).droplevel(0)
#temporary dataframe for date, code, amount
dates = temp.filter(like='Date')
codes = temp.filter(like='Code')
amount = temp.filter(like='Amount')
# put the numbers into a column
dates = dates.stack(level=1).rename_axis(['Person', 'date'])
codes = codes.stack(level=1).rename_axis(['Person', 'code'])
amount.columns = amount.columns.droplevel(0).str.split('-', expand=True)
amount.columns.names = ['code', 'date']
amount = amount.stack(level=['code', 'date']).reset_index(name='Amount')
# join the tables and filter the final dataframe for only rows
# where amount is greater than 0
(codes.merge(amount, left_index = True, right_on = ['Person', 'code'])
.merge(dates, left_on=['Person', 'date'], right_index=True)
.loc[lambda df: df.Amount.gt(0), ['Person', 'Date', 'Code', 'Amount']]
)
Person Date Code Amount
0 John 12/1/21 REGULAR 8
3 John 12/1/21 OVERTIME 2
1 John 12/2/21 REGULAR 8
4 John 12/2/21 OVERTIME 3
7 Mary 12/2/21 REGULAR 8
10 Mary 12/2/21 OVERTIME 1
8 Mary 12/3/21 REGULAR 8