Home > Back-end >  Create cross terms in pandas dataframe
Create cross terms in pandas dataframe

Time:09-23

How should I generate the data2 from data1? The columns (start & end) of data2 should be the corss terms of data1. The corresponding values will be assigned to each cross term. Thanks in advance!

    data = pd.DataFrame([['EU', ('a b'), ('d e'), ('1 2 3 4')],
                         ['NA', ('g h i'), ('j k l'), ('11 12 13 14 15 16 17 18 19')]],
                        columns=['region', 'start', 'end', 'values'])
    
      region  start    end                      values
    0     EU    a b    d e                     1 2 3 4
    1     NA  g h i  j k l  11 12 13 14 15 16 17 18 19
  data2 = pd.DataFrame([['EU', 'a', 'd', 1], ['EU', 'a', 'e', 2], ['EU', 'b', 'd', 3], ['EU', 'b', 'e', 4],
                          ['NA', 'g', 'j', 11], ['NA', 'g', 'k', 12], ['NA', 'g', 'l', 13],
                          ['NA', 'h', 'j', 14], ['NA', 'h', 'k', 15], ['NA', 'h', 'l', 16],
                          ['NA', 'i', 'j', 17], ['NA', 'i', 'k', 18], ['NA', 'i', 'l', 19]],
                         columns=['region', 'start', 'end', 'values'])

   region start end  values
0      EU     a   d       1
1      EU     a   e       2
2      EU     b   d       3
3      EU     b   e       4
4      NA     g   j      11
5      NA     g   k      12
6      NA     g   l      13
7      NA     h   j      14
8      NA     h   k      15
9      NA     h   l      16
10     NA     i   j      17
11     NA     i   k      18
12     NA     i   l      19

CodePudding user response:

You can split the strings, explode start and end successively, then use groupby.transform to explode the values per group in place:

out = (data
 .assign(start=data['start'].str.split(),
         end=data['end'].str.split(),
         values=data['values'].str.split(),
        )
 .explode('start').explode('end')
 .assign(values=lambda d: d.groupby(level=0)['values']
                           .transform(lambda g: g.iloc[0]))
 .reset_index(drop=True)
)

NB. this requires that the number of values after splitting is equal to the product of the number of start and end values.

output:

   region start end values
0      EU     a   d      1
1      EU     a   e      2
2      EU     b   d      3
3      EU     b   e      4
4      NA     g   j     11
5      NA     g   k     12
6      NA     g   l     13
7      NA     h   j     14
8      NA     h   k     15
9      NA     h   l     16
10     NA     i   j     17
11     NA     i   k     18
12     NA     i   l     19
  • Related