Home > OS >  Python: Concat two dataframe with number of times row repeats
Python: Concat two dataframe with number of times row repeats

Time:10-13

Have got two dataframe like below with index values: tab_df

        row     table_no    table_bit   
83      45          1           1
84      45          1           2
85      45          1           3
86      45          2           1
87      45          2           2
88      45          2           3
91      46          1           1
92      46          1           2
93      46          1           3

item_df

        count   Item
15      2       Apple
16      2       Apple
17      1       Lemon
18      2       Apricot
19      2       Apricot
35      2       Melon
36      2       Melon
37      1       Orange
38      1       Potato

Condition to be considered:

  1. tab_df rows along with index value should not be jumbled. It's fixed.
  2. item_df rows to be concat/joined with tab_df to attain output_df considering below rules: (item_df data will be in row-table_no-table_bit order. Each row can have any number of table_no. Each table_no will have only 3 table_bit.) 2.1. item_df rows to be arranged/joined in the same order with tab_df the number of times it repeats. [Column count gives number of times that particular Item repeats]. E.g., Item Apple repeats twice so it takes two bits from first row, then comes Lemon which takes single bit from same row and continues. 2.2. No Item should share table_bit across rows. E.g., On arranging up items with rule 2.1, at the end of row 45, after Apricot, Melon need to be arranged but that row got only single table_bit left out. So, need to check for next Item which needs only single bit and that to be arranged there and Melon should be moved to other row and carry on with the same flow as of rule 2.1 2.3. Each Item in item_df got it's respective index which should not be changed. If any item is rearranged like Melon same index need to be carried out for every item.

output_df

tab_index    row     table_no    table_bit   item_index  count   Item
83           45          1           1           15         2    Apple
84           45          1           2           16         2    Apple
85           45          1           3           17         1    Lemon
86           45          2           1           18         2    Apricot
87           45          2           2           19         2    Apricot
88           45          2           3           37         1    Orange
91           46          1           1           35         2    Melon
92           46          1           2           36         2    Melon
93           46          1           3           38         1    Potato

Looking forward. Any help will be appreciated!

CodePudding user response:

I feel like there ought to be an easier way to do this, but if I understand you correctly, the following should work.

Setup

import pandas as pd

data = {'row': {83: 45, 84: 45, 85: 45, 86: 45, 87: 45, 88: 45, 91: 46, 92: 46, 93: 46}, 
        'table_no': {83: 1, 84: 1, 85: 1, 86: 2, 87: 2, 88: 2, 91: 1, 92: 1, 93: 1}, 
        'table_bit': {83: 1, 84: 2, 85: 3, 86: 1, 87: 2, 88: 3, 91: 1, 92: 2, 93: 3}}
tab_df = pd.DataFrame(data)

data2 = {'count': {15: 2, 16: 2, 17: 1, 18: 2, 19: 2, 35: 2, 36: 2, 37: 1, 38: 1}, 
         'Item': {15: 'Apple', 16: 'Apple', 17: 'Lemon', 18: 'Apricot', 19: 'Apricot', 35: 'Melon', 36: 'Melon', 37: 'Orange', 38: 'Potato'}}
item_df = pd.DataFrame(data2)

Code

tmp = item_df.reset_index().groupby('Item', as_index=False).agg(
    {'index': list, 'count': 'first'})

tmp = tmp.loc[tmp.groupby('count').cumcount().sort_values().index].explode('index')

tmp = tmp.loc[:, ['index','count','Item']].rename(columns={'index':'item_index'})
tab_df[tmp.columns] = tmp.to_numpy()

print(tab_df)

    row  table_no  table_bit item_index count     Item
83   45         1          1         15     2    Apple
84   45         1          2         16     2    Apple
85   45         1          3         17     1    Lemon
86   45         2          1         18     2  Apricot
87   45         2          2         19     2  Apricot
88   45         2          3         37     1   Orange
91   46         1          1         35     2    Melon
92   46         1          2         36     2    Melon
93   46         1          3         38     1   Potato

Explanation

  • First, use df.reset_index and apply df.groupby on Item. Use .agg to turn index values (now a column) per group into lists (e.g. for Apple: [15, 16]), and for column count simply get .first (e.g. for Apple: 2).
  • In the next step, apply df.groupby again, this time on count (now: [2, 2, 1, 2, 1, 1] associated with ['Apple', 'Apricot', 'Lemon', 'Melon', 'Orange', 'Potato']), and get .cumcount. Sorting these values using Series.sort_values gets us a rearranged index [0, 2, 1, 4, 3, 5]. We want the original index order adjusted in this way, which we can achieve by nesting the above inside tmp.loc. Chain .explode on column index to get the lists back as rows again.
  • The row order for tmp (derived from item_df) is now correct; last step is just cosmetic: i.e. getting the columns in the desired order as well: tmp = tmp.loc[:, ['index','count','Item']], and chaining .rename to change col name index into item_index.
  • Finally, add these columns to tab_df. Use to_numpy, since we want to ignore the index of tmp.
  • Related