I have this dataframe which is built from the following code:
d = [{'AX':['chr=1','pos=2'], 'AVF1':[], 'HI':['chr=343', 'pos=4'], 'version_1':[]},
{'AX':[], 'AVF1':['chr=4', 'pos=454'], 'HI':[], 'version_2':[]},
{'AX':['chr=3', 'pos=32'], 'AVF1':['chr=6', 'pos=12'], 'HI':[], 'version_3':[]}]
frame = pd.DataFrame(d)
frame
cols = ['AX','AVF1','HI']
f = frame.T
lst = []
f['temp'] = f.index
for i in f.iloc[-3:, -1]:
lst.append(i)
f = f.drop(columns={'temp'})
f.columns = [lst, f.columns]
f
chrs = pd.DataFrame(index=f.index, columns=pd.MultiIndex.from_product([f.columns.levels[0], ['chr']]))
pos = pd.DataFrame(index=f.index, columns=pd.MultiIndex.from_product([f.columns.levels[0], ['pos']]))
f = pd.concat([f,chrs], axis=1).sort_index(level=0, axis=1)
f = pd.concat([f,pos], axis=1).sort_index(level=0, axis=1)
f = f.drop(f.index[[-1,-2,-3]])
f
version_1 version_2 version_3
0 chr pos 1 chr pos 2 chr pos
AX [chr=1, pos=2] NaN NaN [] NaN NaN [chr=3, pos=32] NaN NaN
AVF1 [] NaN NaN [chr=4, pos=454] NaN NaN [chr=6, pos=12] NaN NaN
HI [chr=343, pos=4] NaN NaN [] NaN NaN [] NaN NaN
I am trying to look at each column beginning with a int (0,1,2) and pattern match for the patterns beginning "chr" and "pos" up to the first comma i.e. "chr=1" or "pos=454". I am then trying to append the value to the corresponding column.
Desired output:
version_1 version_2 version_3
0 chr pos 1 chr pos 2 chr pos
AX [chr=1, pos=2] chr=1 pos=2 [] NaN NaN [chr=3, pos=32] chr=3 pos=32
AVF1 [] NaN NaN [chr=4, pos=454] chr=4 pos=454 [chr=6, pos=12] chr=6 pos=12
HI [chr=343, pos=4] chr=343 pos=4 [] NaN NaN [] NaN NaN
The real dataframe I'm doing this on has a much larger number of columns so listing each column may not be a viable option. I tried the code below but I'm not good at pattern matching.
f['0'].str.extract(pat='chr')
CodePudding user response:
From what I understand, you should start with a simpler structure. Do not create the output structure, then extract the data but rather extract first the data, then create the new columns
Starting from your complex structure, the first step would be to get the 0,1,2 columns. Then we stack
(there is actually no need for a regex as you have lists):
cols = pd.to_numeric(f.columns.get_level_values(1), errors='coerce').notna()
# array([ True, False, False, True, False, False, True, False, False])
# get a single column with the chr/pos lists
s = f.loc[:, cols].droplevel(1, axis=1).stack()
# create a 2D structure from the extracted data and replace in origincal DataFrame
f.loc[:, ~cols] = (pd.DataFrame(s.to_list(), columns=['chr', 'pos'], index=s.index)
.unstack(1).swaplevel(axis=1))
output:
cols = pd.to_numeric(f.columns.get_level_values(1), errors='coerce').notna()
# array([ True, False, False, True, False, False, True, False, False])
s = f.loc[:, cols].droplevel(1, axis=1).stack()
f.loc[:, ~cols] = (pd.DataFrame(s.to_list(), columns=['chr', 'pos'], index=s.index)
.unstack(1).swaplevel(axis=1))