Home > Back-end >  split dataframe by column name if it starts with string in list
split dataframe by column name if it starts with string in list

Time:06-15

My question is similar to this but i'm not sure how to modify the last part for elements in a list only.

I want to split a dataframe into smaller dataframes based on how the column name starts.

For example, column names are in the format of:

df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)))
df.columns = ['P1_ATGC', 'P1_GCTA', 'P2_AACT', 'P2_CGAT']
df
   P1_ATGC  P1_GCTA  P2_AACT  P2_CGAT
0       78       86       47       78
1       22       48       22       43
2       91       12       45       10
3       83       85        9       20
4       82       26       25       71
5       13       36       53       19
6       93       15       30       28
7       24       13       55       23
8       10       49       98       45
9       85       35       77       89

and want to end up with separate df's for each PX. For example something like:

df[0]
   P1_ATGC  P1_GCTA
0       78       86
1       22       48
2       91       12
3       83       85
4       82       26
5       13       36
6       93       15
7       24       13
8       10       49
9       85       35

df[1]
   P2_AACT  P2_CGAT
0       47       78
1       22       43
2       45       10
3        9       20
4       25       71
5       53       19
6       30       28
7       55       23
8       98       45
9       77       89

i'm able to get the unique PXs with: np.unique([x.split('_')[0] for x in df.columns]) it returns:

array(['P1', 'P2'], dtype='<U2')

But how do I split the dataframes by columns ones based on the PX it belongs to?

CodePudding user response:

One way using pandas.DataFrame.groupby:

res = []
for k, d in df.groupby(df.columns.str.split("_").str[0], axis=1):
    res.append(d)
    print(d)

Output:

   P1_ATGC  P1_GCTA
0       78       86
1       22       48
2       91       12
3       83       85
4       82       26
5       13       36
6       93       15
7       24       13
8       10       49
9       85       35
   P2_AACT  P2_CGAT
0       47       78
1       22       43
2       45       10
3        9       20
4       25       71
5       53       19
6       30       28
7       55       23
8       98       45
9       77       89

CodePudding user response:

Option 1

Use dictionary comprehension with pd.DataFrame.filter:

{i:df.filter(like=i) for i in df.columns.str.split('_').str[0]}

Option 2

Let's try this by creating a dictionary of dataframes using groupby with axis=1 on string manipulation of df.coulumns:

dict_dfs = dict(tuple(df.groupby(df.columns.str.split('_').str[0], axis=1)))

dict_dfs['P1']

Output

   P1_ATGC  P1_GCTA
0       17       16
1       96       60
2       73       95
3       51       99
4       55       22
5        9        5
6       58       55
7       76       34
8       69       82
9       72       81

And,

dict_dfs['P2']

Output:

   P2_AACT  P2_CGAT
0       69       96
1       42       96
2       60       10
3       24       32
4        3       84
5       81       34
6       87       91
7       52       52
8       74       80
9       50       12

CodePudding user response:

You can do

d = dict(zip(df.columns,pd.factorize(df.columns.str.split('_').str[0])[0] 1))
dfs = {x : y for x , y in df.groupby(d,axis=1)}
dfs[1]
Out[657]: 
   P1_ATGC  P1_GCTA
0       44       47
1       67        9
2       36       87
3       88       12
4       39       87
5       81       37
6       72        9
7       69       79
8       82       99
9       29       19
  • Related