My question is similar to this but i'm not sure how to modify the last part for elements in a list only.
I want to split a dataframe into smaller dataframes based on how the column name starts.
For example, column names are in the format of:
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)))
df.columns = ['P1_ATGC', 'P1_GCTA', 'P2_AACT', 'P2_CGAT']
df
P1_ATGC P1_GCTA P2_AACT P2_CGAT
0 78 86 47 78
1 22 48 22 43
2 91 12 45 10
3 83 85 9 20
4 82 26 25 71
5 13 36 53 19
6 93 15 30 28
7 24 13 55 23
8 10 49 98 45
9 85 35 77 89
and want to end up with separate df's for each PX. For example something like:
df[0]
P1_ATGC P1_GCTA
0 78 86
1 22 48
2 91 12
3 83 85
4 82 26
5 13 36
6 93 15
7 24 13
8 10 49
9 85 35
df[1]
P2_AACT P2_CGAT
0 47 78
1 22 43
2 45 10
3 9 20
4 25 71
5 53 19
6 30 28
7 55 23
8 98 45
9 77 89
i'm able to get the unique PXs with: np.unique([x.split('_')[0] for x in df.columns])
it returns:
array(['P1', 'P2'], dtype='<U2')
But how do I split the dataframes by columns ones based on the PX it belongs to?
CodePudding user response:
One way using pandas.DataFrame.groupby
:
res = []
for k, d in df.groupby(df.columns.str.split("_").str[0], axis=1):
res.append(d)
print(d)
Output:
P1_ATGC P1_GCTA
0 78 86
1 22 48
2 91 12
3 83 85
4 82 26
5 13 36
6 93 15
7 24 13
8 10 49
9 85 35
P2_AACT P2_CGAT
0 47 78
1 22 43
2 45 10
3 9 20
4 25 71
5 53 19
6 30 28
7 55 23
8 98 45
9 77 89
CodePudding user response:
Option 1
Use dictionary comprehension with pd.DataFrame.filter
:
{i:df.filter(like=i) for i in df.columns.str.split('_').str[0]}
Option 2
Let's try this by creating a dictionary of dataframes using groupby with axis=1 on string manipulation of df.coulumns:
dict_dfs = dict(tuple(df.groupby(df.columns.str.split('_').str[0], axis=1)))
dict_dfs['P1']
Output
P1_ATGC P1_GCTA
0 17 16
1 96 60
2 73 95
3 51 99
4 55 22
5 9 5
6 58 55
7 76 34
8 69 82
9 72 81
And,
dict_dfs['P2']
Output:
P2_AACT P2_CGAT
0 69 96
1 42 96
2 60 10
3 24 32
4 3 84
5 81 34
6 87 91
7 52 52
8 74 80
9 50 12
CodePudding user response:
You can do
d = dict(zip(df.columns,pd.factorize(df.columns.str.split('_').str[0])[0] 1))
dfs = {x : y for x , y in df.groupby(d,axis=1)}
dfs[1]
Out[657]:
P1_ATGC P1_GCTA
0 44 47
1 67 9
2 36 87
3 88 12
4 39 87
5 81 37
6 72 9
7 69 79
8 82 99
9 29 19