Home > Blockchain >  Allocate values between two pandas dataframes
Allocate values between two pandas dataframes

Time:01-27

Consider two dataframes:

>> import pandas as pd
>> df1 = pd.DataFrame({"category": ["foo", "foo", "bar", "bar", "bar"], "quantity": [1,2,1,2,3]})
>> print(df1)

    category    quantity
0   foo         1
1   foo         2
2   bar         1
3   bar         2
4   bar         3

>> df2 = pd.DataFrame({
            "category": ["foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "bar", "bar"], 
            "item": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]
        })
>> print(df2)
      category item
0      foo      A
1      foo      B
2      foo      C
3      foo      D
4      bar      E
5      bar      F
6      bar      G
7      bar      H
8      bar      I
9      bar      J

How can I create a new column in df1 (new dataframe called df3) which joins on category column of df1 and allocates the item column in df2. So, create something like:

>> df3 = pd.DataFrame({
           "category": ["foo", "foo", "bar", "bar", "bar"], 
           "quantity": [1,2,1,2,3],
           "item": ["A", "B,C", "E", "F,G", "H,I,J"] 
})

     category  quantity   item
0      foo         1      A
1      foo         2      B,C
2      bar         1      E
3      bar         2      F,G
4      bar         3      H,I,J

CodePudding user response:

You can create helper DataFrame by repeat rows by quantity column by Index.repeat with DataFrame.loc, convert index to column for avoid lost indices and create helper column g in both DataFrames for merging by duplicated categories by GroupBy.cumcount, then use DataFrame.merge with aggregate join:

df11 = (df1.loc[df1.index.repeat(df1['quantity'])].reset_index()
           .assign(g = lambda x: x.groupby('category').cumcount()))

df22 = df2.assign(g = df2.groupby('category').cumcount())

df = (df11.merge(df22, on=['g','category'], how='left')
          .groupby(['index','category','quantity'])['item']
          .agg(lambda x: ','.join(x.dropna()))
          .droplevel(0)
          .reset_index())
print (df)
  category  quantity   item
0      foo         1      A
1      foo         2    B,C
2      bar         1      E
3      bar         2    F,G
4      bar         3  H,I,J

CodePudding user response:

You can use an iterator with itertools.islice:

from itertools import islice

# aggregate the items as iterator
s = df2.groupby('category')['item'].agg(iter)

# for each category, allocate as many items as needed and join
df1['item'] = (df1.groupby('category', group_keys=False)['quantity']
                  .apply(lambda g:
                         g.map(lambda x: ','.join(list(islice(s[g.name], x)))))
               )

Output:

  category  quantity   item
0      foo         1      A
1      foo         2    B,C
2      bar         1      E
3      bar         2    F,G
4      bar         3  H,I,J

Note that if you don't have enough items, this will just use what is available.

Example using df2 truncated after F as input:

  category  quantity item
0      foo         1    A
1      foo         2  B,C
2      bar         1    E
3      bar         2    F
4      bar         3     
  • Related