pandas matching pattern and appending to column multiple times in same dataframe-CodePudding

I have this dataframe which is built from the following code:

d = [{'AX':['chr=1','pos=2'], 'AVF1':[], 'HI':['chr=343', 'pos=4'], 'version_1':[]},
      {'AX':[], 'AVF1':['chr=4', 'pos=454'], 'HI':[], 'version_2':[]},
      {'AX':['chr=3', 'pos=32'], 'AVF1':['chr=6', 'pos=12'], 'HI':[], 'version_3':[]}]

frame = pd.DataFrame(d)

frame

cols = ['AX','AVF1','HI']

f = frame.T

lst = []
f['temp'] = f.index
for i in f.iloc[-3:, -1]:
  lst.append(i)
f = f.drop(columns={'temp'})

f.columns = [lst, f.columns]
f

chrs = pd.DataFrame(index=f.index, columns=pd.MultiIndex.from_product([f.columns.levels[0], ['chr']]))
pos = pd.DataFrame(index=f.index, columns=pd.MultiIndex.from_product([f.columns.levels[0], ['pos']]))



f = pd.concat([f,chrs], axis=1).sort_index(level=0, axis=1)
f = pd.concat([f,pos], axis=1).sort_index(level=0, axis=1)

f = f.drop(f.index[[-1,-2,-3]])
f

        version_1                        version_2                                version_3
        0                 chr      pos   1                   chr        pos       2                 chr      pos
AX       [chr=1, pos=2]   NaN      NaN    []                 NaN        NaN        [chr=3, pos=32]  NaN      NaN
AVF1     []               NaN      NaN    [chr=4, pos=454]   NaN        NaN        [chr=6, pos=12]  NaN      NaN
HI       [chr=343, pos=4] NaN      NaN    []                 NaN        NaN        []               NaN      NaN

I am trying to look at each column beginning with a int (0,1,2) and pattern match for the patterns beginning "chr" and "pos" up to the first comma i.e. "chr=1" or "pos=454". I am then trying to append the value to the corresponding column.

Desired output:

        version_1                        version_2                                version_3
        0                 chr      pos   1                   chr        pos       2                 chr      pos
AX       [chr=1, pos=2]   chr=1    pos=2      []             NaN        NaN        [chr=3, pos=32]  chr=3    pos=32
AVF1     []               NaN      NaN    [chr=4, pos=454]   chr=4      pos=454    [chr=6, pos=12]  chr=6    pos=12
HI       [chr=343, pos=4] chr=343  pos=4  []                 NaN        NaN        []               NaN      NaN

The real dataframe I'm doing this on has a much larger number of columns so listing each column may not be a viable option. I tried the code below but I'm not good at pattern matching.

f['0'].str.extract(pat='chr')

CodePudding user response：

From what I understand, you should start with a simpler structure. Do not create the output structure, then extract the data but rather extract first the data, then create the new columns

Starting from your complex structure, the first step would be to get the 0,1,2 columns. Then we stack (there is actually no need for a regex as you have lists):

cols = pd.to_numeric(f.columns.get_level_values(1), errors='coerce').notna()
# array([ True, False, False,  True, False, False,  True, False, False])

# get a single column with the chr/pos lists
s = f.loc[:, cols].droplevel(1, axis=1).stack()

# create a 2D structure from the extracted data and replace in origincal DataFrame
f.loc[:, ~cols] = (pd.DataFrame(s.to_list(), columns=['chr', 'pos'], index=s.index)
                     .unstack(1).swaplevel(axis=1))

output:

cols = pd.to_numeric(f.columns.get_level_values(1), errors='coerce').notna()
# array([ True, False, False,  True, False, False,  True, False, False])

s = f.loc[:, cols].droplevel(1, axis=1).stack()

f.loc[:, ~cols] = (pd.DataFrame(s.to_list(), columns=['chr', 'pos'], index=s.index)
                     .unstack(1).swaplevel(axis=1))