subset columns based on partial match and group level in python-CodePudding

I am trying to split my dataframe based on a partial match of the column name, using a group level stored in a separate dataframe. The dataframes are here, and the expected output is below

df = pd.DataFrame(data={'a19-76': [0,1,2],
                        'a23pz': [0,1,2],
                        'a23pze': [0,1,2],
                        'b887': [0,1,2],
                        'b59lp':[0,1,2],
                        'c56-6u': [0,1,2],
                        'c56-6uY': [np.nan, np.nan, np.nan]})

ids = pd.DataFrame(data={'id':   ['a19', 'a23', 'b8', 'b59', 'c56'], 
                        'group': ['test', 'sub', 'test', 'pass', 'fail']})

desired output

test_ids = 'a19-76', 'b887'
sub_ids = 'a23pz', 'a23pze', 'c56-6u' 
pass_ids = 'b59lp'
fail_ids = 'c56-6u', 'c56-6uY'

I have written thise onliner, which assigned the group to each column name, but doesnt create two seperate lists as required above

gb = ids.groupby([[col for col in df.columns if col.startswith(tuple(i for i in ids.id))], 'group']).agg(lambda x: list(x)).reset_index()
gb.groupby('group').agg({'level_0':lambda x: list(x)})

thanks for reading

CodePudding user response：

You can use a regex generated from the values in iidf and filter:

Example with "test":

s = iddf.set_index('group')['id']

regex_test = '^(%s)' % '|'.join(s.loc['test'])
# the generated regex is: '^(a19|b8|b59)'

df.filter(regex=regex_test)

output:

   a19-76  b887  b59lp
0       0     0      0
1       1     1      1
2       2     2      2

To get a list of columns for each unique group in iidf, apply the same process in a dictionary comprehension:

{x: list(df.filter(regex='^(%s)' % '|'.join(s.loc[x])).columns)
 for x in s.index.unique()}

output:

{'test': ['a19-76', 'b887', 'b59lp'],
 'sub': ['a23pz', 'c56-6u']}

NB. this should generalize to any number of groups, however, if really there are many groups, it will be preferable to loop on the columns names rather than using filter repeatedly

CodePudding user response：

May be not what you are looking for, but anyway.

Plain python solution

Simple lists wrangling, but there may be a simpler one

from collections import defaultdict

groups = defaultdict(list)
idsr = ids.to_records(index=False)
for col in df.columns:
    for id, group in idsr:
        if col.startswith(id):
            groups[group].append(col)
            break
    else: # for ... else ...
        groups['UNGROUPED'].append(col)

Groups =

{'sub': ['a23pz', 'c56-6u'], 'test': ['a19-76', 'b887', 'b59lp']}

Then after

df.columns = pd.MultiIndex.from_tuples(sorted([(k, col) for k,id in groups.items() for col in id]))

df =

    sub          test           
  a23pz c56-6u a19-76 b59lp b887
0     0      0      0     0    0
1     1      1      1     1    1
2     2      2      2     2    2

pandas solution

Columns to dataframe
product of dataframes (join )
filtering of the resulting dataframe

df1 = ids.copy()

df2 = df.columns.to_frame(index=False)
df2.columns = ['col']


# Not tested enhancement :
# with pandas version >= 1.2, the four following lines may be replaced by a single one :
# dfm = df1.merge(df2, how='cross')

df1['join'] = 1
df2['join'] = 1
dfm = df1.merge(df2, on='join').drop('join', axis=1)
df1.drop('join', axis=1, inplace = True)

dfm['match'] = dfm.apply(lambda x: x.col.find(x.id), axis=1).ge(0)
dfm = dfm[dfm.match][['group', 'col']].sort_values(by=['group', 'col'], axis=0)

dfm =

   group     col
6    sub   a23pz
24   sub  c56-6u
0   test  a19-76
18  test   b59lp
12  test    b887

then after

df.columns = pd.MultiIndex.from_frame(dfm)

df =

group   sub          test           
col   a23pz c56-6u a19-76 b59lp b887
0         0      0      0     0    0
1         1      1      1     1    1
2         2      2      2     2    2