Pandas: Append copy of rows changing only values in multiple columns larger than max allowed to spli-CodePudding

Problem: I have a data frame that I need to modify based on the values of particular column. If value of any column value is greater than that of maximum allowed then a new row will be created based upon distribution into equally sized bins (taking integer division between data value and max allowed value)

Table and Explanation:

Original:

Index	Data 1	Data 2	Max. Allowed
1	1	2	3
2	10	5	8
3	7	12	5

Required:

Index Values in brackets refers to the original index value

Index	Data 1	Data 2	Max. Allowed
1 (1)	1	2	3
2 (2)	8	5	8
3	2	0	8
4 (3)	5	5	5
5	2	5	5
6	0	2	5

Since original index = 2, had Data1 = 10 which is greater than max allowed = 8. This row has been broken into two rows as shown in the above table.

Attempt: I was able to find those columns which had value greater than max allowed and number of rows to be inserted. But I had a confusion wether that approach would work if both columns would have value greater than the max allowed value (as in case of index = 3). The values indicate how many more rows to be inserted for each index value for particular column.

Index	Data 1	Data 2
1	0	0
2	1	0
3	1	2

CodePudding user response：

Assuming you are willing to process the data frame row by row, then you could carry out the check for maximum value in a while loop and populate a new data frame with the new rows.

import pandas as pd
df = pd.DataFrame({"Index" : [1, 2, 3], "Data 1" : [1,10,7], "Data 2" : [2,5,12], "Max_Allowed" : [3,8,5]})
print(df)
# create a new data frame that we can populate with rows of data
dfz = pd.DataFrame(columns=("Index", "Data 1","Data 2","Max_Allowed"))

iz = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
    
    if col2<=col4 and col3<=col4:             
        dfz.loc[iz] = [str(iz 1) "(" str(col1) ")", col2, col3, col4]
        iz  = 1
    else:
        iz_orig = iz  # keep the index we are at currently
        while col2>0 or col3>0:
            
            if col2>col4:   # check if more than maximum value for Data 1
                col2p=col4
                col2 -= col4  # minus the maximum value from current value
            else:
                col2p=col2
                col2 = 0   # set the value to zero
                
            if col3>col4:  # check if more than maximum value for Data 2
                col3p=col4
                col3 -= col4
            else:
                col3p=col3
                col3 = 0

            if iz_orig == iz:
                # enter with the original Index in parenthesis
                dfz.loc[iz] = [str(iz 1) "(" str(col1) ")", col2p, col3p, col4]
            else:
                # enter row with just the new Index
                dfz.loc[iz] = [str(iz 1), col2p, col3p, col4]
                
            iz  = 1

print(dfz)

CodePudding user response：

Let's approach in the following steps:

Step 1: Preparation of split values:

Define custom lambda function to turn Data 1, Data 2 into lists of values split with Max. Allowed if larger than it. Hold the expanded lists in 2 new columns Data 1x, Data 2x:

f = lambda x, y, z: [z] * (x // z)   [x % z]   [0] * (max(x//z, y//z) - x//z)

df['Data 1x'] = df.apply(lambda x: f(x['Data 1'], x['Data 2'], x['Max. Allowed'])  , axis=1)
df['Data 2x'] = df.apply(lambda x: f(x['Data 2'], x['Data 1'], x['Max. Allowed'])  , axis=1)

The lambda function is designed to add 0 into the lists to make the number of elements in lists in the same row to have the same lengths.

Intermediate result:

print(df)

   Index  Data 1  Data 2  Max. Allowed    Data 1x    Data 2x
0      1       1       2             3        [1]        [2]
1      2      10       5             8     [8, 2]     [5, 0]
2      3       7      12             5  [5, 2, 0]  [5, 5, 2]

Step 2: Explode split values into separate rows:

Case 1: If your Pandas version is 1.3 or above

We use DataFrame.explode() to explode the 2 new columns: (this part of feature to explode multiple columns requires Pandas version 1.3 or above)

df = df.explode(['Data 1x', 'Data 2x'])

Case 2: For Pandas version lower than 1.3, try the following way to explode:

df = df.apply(pd.Series.explode)

Case 3: If the above 2 ways to explode don't work in your programming environment, use:

df_exp = df.explode('Data 1x')[['Index', 'Data 1', 'Data 2', 'Max. Allowed']].reset_index(drop=True)
df_1x = df.explode('Data 1x')[['Data 1x']].reset_index(drop=True)
df_2x = df.explode('Data 2x')[['Data 2x']].reset_index(drop=True)

df = df_exp.join([df_1x, df_2x])

Result:

print(df)

   Index  Data 1  Data 2  Max. Allowed Data 1x Data 2x
0      1       1       2             3       1       2
1      2      10       5             8       8       5
1      2      10       5             8       2       0
2      3       7      12             5       5       5
2      3       7      12             5       2       5
2      3       7      12             5       0       2

Step 3: Formatting to the required output:

# select and rename columns
df = (df[['Index', 'Data 1x',  'Data 2x', 'Max. Allowed']]
        .rename({'Data 1x': 'Data 1', 'Data 2x': 'Data 2'}, axis=1)
        .reset_index(drop=True)
     )

# reset the `Index` values
df['Index'] = df.index   1

Final result:

print(df)


   Index Data 1 Data 2  Max. Allowed
0      1      1      2             3
1      2      8      5             8
2      3      2      0             8
3      4      5      5             5
4      5      2      5             5
5      6      0      2             5