I posted this question in a diff senario and now I would like to know how to do it using a loop. For larger dataframes where i have say 100 columns it becomes difficult to manually concatenate.
I have this Pandas DataFrame:
data4 = {'g_pairs':['an_jk', 'tf_ha', 'ab_rh', 'et_x2','yr_po'],
'g_a':['en762','en72b','en925','en980','en009'],
'g_b':['en361','en231','en666','en771','en909'],
'epi|ap':[0.020,1,0.05,0.7,0.001],
'ap|epi':[1,1,0.1,0.0001,1],
'fib|mac':[0.001,0.002,0.0021,0.3,0.005],
'mac|fib':[0.0002,0.0043,0.0067,0.0123,0.0110]}
df4 = pd.DataFrame(data4)
g_pairs g_a g_b epi|ap ap|epi fib|mac mac|fib
0 an_jk en762 en361 0.020 1.0000 0.0010 0.0002
1 tf_ha en72b en231 1.000 1.0000 0.0020 0.0043
2 ab_rh en925 en666 0.050 0.1000 0.0021 0.0067
3 et_x2 en980 en771 0.700 0.0001 0.3000 0.0123
4 yr_po en009 en909 0.001 1.0000 0.0050 0.0110
I would like to restructure the table by repositioning column name opposites. For example, an_jk has a value of 1 under ap|epi, but if I flip the name to jk_an I can add that new item under the epi|ap column and get rid of the ap|epi column.
In essence,I would like to define a list with all names and have the loop iterate through this the list to generate this table.
list = ['ap','epi','fib','mac']
for i in range(0,len(list)):
for j in range(i, len(list)):
col = list[i] '|' list[j]
col2 = list[j] '|' list[i]
something somewhat to that effect.
Output should look like this:
g_pairs g_a g_b epi|ap fib|mac
0 an_jk en762 en361 0.0200 0.0010
1 jk_an en762 en361 1.0000 0.0002
2 tf_ha en72b en231 1.0000 0.0020
3 ha_tf en72b en231 1.0000 0.0043
4 ab_rh en925 en666 0.0500 0.0021
5 rh_ab en925 en666 0.1000 0.0067
6 et_x2 en980 en771 0.7000 0.3000
7 x2_et en980 en771 0.0001 0.0123
8 yr_po en009 en909 0.0010 0.0050
9 po_yr en009 en909 1.0000 0.0110
CodePudding user response:
One of the solution without for loop to create the df , I only use for loop to change the column name which will be really fast.
df.columns = ['|'.join(sorted(x.split('|'))) if '|' in x else x for x in df.columns]
df = df.set_index(['g_pairs','g_a','g_b'])
df.columns = pd.MultiIndex.from_arrays([df.columns,df.groupby(level=0,axis=1).cumcount()])
out = df.stack().reset_index()
g_pairs g_a g_b level_3 ap|epi fib|mac
0 an_jk en762 en361 0 0.0200 0.0010
1 an_jk en762 en361 1 1.0000 0.0002
2 tf_ha en72b en231 0 1.0000 0.0020
3 tf_ha en72b en231 1 1.0000 0.0043
4 ab_rh en925 en666 0 0.0500 0.0021
5 ab_rh en925 en666 1 0.1000 0.0067
6 et_x2 en980 en771 0 0.7000 0.3000
7 et_x2 en980 en771 1 0.0001 0.0123
8 yr_po en009 en909 0 0.0010 0.0050
9 yr_po en009 en909 1 1.0000 0.0110
CodePudding user response:
You could use pivot_longer from pyjanitor to reshape the data; this works best if your column is ordered, and you know the names of the new columns:
# pip install pyjanitor
import pandas as pd
import janitor as jn
outcome = df4.pivot_longer(index = slice('g_pairs', 'g_b'), # or index = 'g*',
names_to=['epi|ap', 'fib|mac'],
names_pattern=['ap', 'fib'],
sort_by_appearance=True)
print(outcome)
g_pairs g_a g_b epi|ap fib|mac
0 an_jk en762 en361 0.0200 0.0010
1 an_jk en762 en361 1.0000 0.0002
2 tf_ha en72b en231 1.0000 0.0020
3 tf_ha en72b en231 1.0000 0.0043
4 ab_rh en925 en666 0.0500 0.0021
5 ab_rh en925 en666 0.1000 0.0067
6 et_x2 en980 en771 0.7000 0.3000
7 et_x2 en980 en771 0.0001 0.0123
8 yr_po en009 en909 0.0010 0.0050
9 yr_po en009 en909 1.0000 0.0110
You can avoid pivot_longer
and build up the reshaping process (with the idea from @BENY's solution):
outcome = df4.set_index(['g_pairs', 'g_a', 'g_b'])
# this results in duplicated columns
cols = outcome.columns.str.split('|', expand = True).map(sorted).map(tuple)
# the counter allows us to create unique columns,
# via MultiIndex
counter = pd.Series(cols.get_level_values(0)).groupby(cols).cumcount()
outcome.columns = pd.MultiIndex.from_arrays([cols.map("|".join), counter])
(outcome.stack(level = -1)
.droplevel(-1)
.reset_index()
.rename(columns={"ap|epi":"epi|ap"})
)
g_pairs g_a g_b ep|api fib|mac
0 an_jk en762 en361 0.0200 0.0010
1 an_jk en762 en361 1.0000 0.0002
2 tf_ha en72b en231 1.0000 0.0020
3 tf_ha en72b en231 1.0000 0.0043
4 ab_rh en925 en666 0.0500 0.0021
5 ab_rh en925 en666 0.1000 0.0067
6 et_x2 en980 en771 0.7000 0.3000
7 et_x2 en980 en771 0.0001 0.0123
8 yr_po en009 en909 0.0010 0.0050
9 yr_po en009 en909 1.0000 0.0110
This works as long as the index is unique; pivot_longer
covers scenarios where index may not be unique.
To go further and flip the values in g_pairs
, we can leverage str.replace
after a groupby
with cumcount
:
First, groupby to get the cumulative count:
outcome['counter'] = outcome.groupby('g_pairs').cumcount()
print(outcome)
g_pairs g_a g_b epi|ap fib|mac counter
0 an_jk en762 en361 0.0200 0.0010 0
1 an_jk en762 en361 1.0000 0.0002 1
2 tf_ha en72b en231 1.0000 0.0020 0
3 tf_ha en72b en231 1.0000 0.0043 1
4 ab_rh en925 en666 0.0500 0.0021 0
5 ab_rh en925 en666 0.1000 0.0067 1
6 et_x2 en980 en771 0.7000 0.3000 0
7 et_x2 en980 en771 0.0001 0.0123 1
8 yr_po en009 en909 0.0010 0.0050 0
9 yr_po en009 en909 1.0000 0.0110 1
Next, use pd.str.replace
to flip the text if counter is equal to 1:
pat = r"(?P<first>. )_(?P<last>. )"
repl = lambda m: f"{m.group('last')}_{m.group('first')}"
outcome['g_pairs'] = outcome.g_pairs.where(outcome.counter.eq(0),
outcome.g_pairs
.str.replace(pat,
repl,
regex = True)
)
outcome.drop(columns='counter')
g_pairs g_a g_b epi|ap fib|mac
0 an_jk en762 en361 0.0200 0.0010
1 jk_an en762 en361 1.0000 0.0002
2 tf_ha en72b en231 1.0000 0.0020
3 ha_tf en72b en231 1.0000 0.0043
4 ab_rh en925 en666 0.0500 0.0021
5 rh_ab en925 en666 0.1000 0.0067
6 et_x2 en980 en771 0.7000 0.3000
7 x2_et en980 en771 0.0001 0.0123
8 yr_po en009 en909 0.0010 0.0050
9 po_yr en009 en909 1.0000 0.0110