Home > Enterprise >  How to restructure dataframe using a LOOP
How to restructure dataframe using a LOOP

Time:09-21

I posted this question in a diff senario and now I would like to know how to do it using a loop. For larger dataframes where i have say 100 columns it becomes difficult to manually concatenate.

I have this Pandas DataFrame:

data4 = {'g_pairs':['an_jk', 'tf_ha', 'ab_rh', 'et_x2','yr_po'],
         'g_a':['en762','en72b','en925','en980','en009'],
         'g_b':['en361','en231','en666','en771','en909'],
         'epi|ap':[0.020,1,0.05,0.7,0.001],
         'ap|epi':[1,1,0.1,0.0001,1],
         'fib|mac':[0.001,0.002,0.0021,0.3,0.005],
         'mac|fib':[0.0002,0.0043,0.0067,0.0123,0.0110]}
 
df4 = pd.DataFrame(data4)
   g_pairs  g_a     g_b    epi|ap   ap|epi  fib|mac mac|fib
0   an_jk   en762   en361   0.020   1.0000  0.0010  0.0002
1   tf_ha   en72b   en231   1.000   1.0000  0.0020  0.0043
2   ab_rh   en925   en666   0.050   0.1000  0.0021  0.0067
3   et_x2   en980   en771   0.700   0.0001  0.3000  0.0123
4   yr_po   en009   en909   0.001   1.0000  0.0050  0.0110

I would like to restructure the table by repositioning column name opposites. For example, an_jk has a value of 1 under ap|epi, but if I flip the name to jk_an I can add that new item under the epi|ap column and get rid of the ap|epi column.

In essence,I would like to define a list with all names and have the loop iterate through this the list to generate this table.

list = ['ap','epi','fib','mac']

for i in range(0,len(list)):
     for j in range(i, len(list)):
         col = list[i] '|' list[j]
         col2 = list[j] '|' list[i]

something somewhat to that effect.

Output should look like this:

 g_pairs  g_a     g_b     epi|ap  fib|mac
0   an_jk   en762   en361   0.0200  0.0010
1   jk_an   en762   en361   1.0000  0.0002
2   tf_ha   en72b   en231   1.0000  0.0020
3   ha_tf   en72b   en231   1.0000  0.0043
4   ab_rh   en925   en666   0.0500  0.0021
5   rh_ab   en925   en666   0.1000  0.0067
6   et_x2   en980   en771   0.7000  0.3000
7   x2_et   en980   en771   0.0001  0.0123
8   yr_po   en009   en909   0.0010  0.0050
9   po_yr   en009   en909   1.0000  0.0110

CodePudding user response:

One of the solution without for loop to create the df , I only use for loop to change the column name which will be really fast.

df.columns = ['|'.join(sorted(x.split('|'))) if '|' in x else x for x in df.columns]
df = df.set_index(['g_pairs','g_a','g_b'])
df.columns = pd.MultiIndex.from_arrays([df.columns,df.groupby(level=0,axis=1).cumcount()])
out = df.stack().reset_index()
g_pairs    g_a    g_b  level_3  ap|epi  fib|mac
0   an_jk  en762  en361        0  0.0200   0.0010
1   an_jk  en762  en361        1  1.0000   0.0002
2   tf_ha  en72b  en231        0  1.0000   0.0020
3   tf_ha  en72b  en231        1  1.0000   0.0043
4   ab_rh  en925  en666        0  0.0500   0.0021
5   ab_rh  en925  en666        1  0.1000   0.0067
6   et_x2  en980  en771        0  0.7000   0.3000
7   et_x2  en980  en771        1  0.0001   0.0123
8   yr_po  en009  en909        0  0.0010   0.0050
9   yr_po  en009  en909        1  1.0000   0.0110

CodePudding user response:

You could use pivot_longer from pyjanitor to reshape the data; this works best if your column is ordered, and you know the names of the new columns:

# pip install pyjanitor
import pandas as pd
import janitor as jn
outcome = df4.pivot_longer(index = slice('g_pairs', 'g_b'), # or index = 'g*', 
                           names_to=['epi|ap', 'fib|mac'], 
                           names_pattern=['ap', 'fib'],  
                           sort_by_appearance=True)

print(outcome)

  g_pairs    g_a    g_b  epi|ap  fib|mac
0   an_jk  en762  en361  0.0200   0.0010
1   an_jk  en762  en361  1.0000   0.0002
2   tf_ha  en72b  en231  1.0000   0.0020
3   tf_ha  en72b  en231  1.0000   0.0043
4   ab_rh  en925  en666  0.0500   0.0021
5   ab_rh  en925  en666  0.1000   0.0067
6   et_x2  en980  en771  0.7000   0.3000
7   et_x2  en980  en771  0.0001   0.0123
8   yr_po  en009  en909  0.0010   0.0050
9   yr_po  en009  en909  1.0000   0.0110

You can avoid pivot_longer and build up the reshaping process (with the idea from @BENY's solution):

outcome = df4.set_index(['g_pairs', 'g_a', 'g_b'])
# this results in duplicated columns
cols = outcome.columns.str.split('|', expand = True).map(sorted).map(tuple) 
# the counter allows us to create unique columns,
# via MultiIndex
counter = pd.Series(cols.get_level_values(0)).groupby(cols).cumcount()
outcome.columns = pd.MultiIndex.from_arrays([cols.map("|".join), counter])
(outcome.stack(level = -1)
        .droplevel(-1)
        .reset_index()
        .rename(columns={"ap|epi":"epi|ap"})
)

  g_pairs    g_a    g_b  ep|api  fib|mac
0   an_jk  en762  en361  0.0200   0.0010
1   an_jk  en762  en361  1.0000   0.0002
2   tf_ha  en72b  en231  1.0000   0.0020
3   tf_ha  en72b  en231  1.0000   0.0043
4   ab_rh  en925  en666  0.0500   0.0021
5   ab_rh  en925  en666  0.1000   0.0067
6   et_x2  en980  en771  0.7000   0.3000
7   et_x2  en980  en771  0.0001   0.0123
8   yr_po  en009  en909  0.0010   0.0050
9   yr_po  en009  en909  1.0000   0.0110

This works as long as the index is unique; pivot_longer covers scenarios where index may not be unique.

To go further and flip the values in g_pairs, we can leverage str.replace after a groupby with cumcount:

First, groupby to get the cumulative count:

 outcome['counter'] = outcome.groupby('g_pairs').cumcount()
print(outcome)
g_pairs    g_a    g_b  epi|ap  fib|mac  counter
0   an_jk  en762  en361  0.0200   0.0010        0
1   an_jk  en762  en361  1.0000   0.0002        1
2   tf_ha  en72b  en231  1.0000   0.0020        0
3   tf_ha  en72b  en231  1.0000   0.0043        1
4   ab_rh  en925  en666  0.0500   0.0021        0
5   ab_rh  en925  en666  0.1000   0.0067        1
6   et_x2  en980  en771  0.7000   0.3000        0
7   et_x2  en980  en771  0.0001   0.0123        1
8   yr_po  en009  en909  0.0010   0.0050        0
9   yr_po  en009  en909  1.0000   0.0110        1

Next, use pd.str.replace to flip the text if counter is equal to 1:

pat = r"(?P<first>. )_(?P<last>. )"

repl = lambda m: f"{m.group('last')}_{m.group('first')}"

 outcome['g_pairs'] = outcome.g_pairs.where(outcome.counter.eq(0), 
                                           outcome.g_pairs
                                                 .str.replace(pat, 
                                                              repl, 
                                                              regex = True)
                                           )

 outcome.drop(columns='counter')

  g_pairs    g_a    g_b  epi|ap  fib|mac
0   an_jk  en762  en361  0.0200   0.0010
1   jk_an  en762  en361  1.0000   0.0002
2   tf_ha  en72b  en231  1.0000   0.0020
3   ha_tf  en72b  en231  1.0000   0.0043
4   ab_rh  en925  en666  0.0500   0.0021
5   rh_ab  en925  en666  0.1000   0.0067
6   et_x2  en980  en771  0.7000   0.3000
7   x2_et  en980  en771  0.0001   0.0123
8   yr_po  en009  en909  0.0010   0.0050
9   po_yr  en009  en909  1.0000   0.0110
  • Related