Reproducible data:
import random
data = {'test_1_a':random.sample(range(1, 50), 7),
'test_1_b':random.sample(range(1, 50), 7),
'test_1_c':random.sample(range(1, 50), 7),
'test_2_a':random.sample(range(1, 50), 7),
'test_2_b':random.sample(range(1, 50), 7),
'test_2_c':random.sample(range(1, 50), 7),
'test_3_a':random.sample(range(1, 50), 7),
'test_4_b':random.sample(range(1, 50), 7),
'test_4_c':random.sample(range(1, 50), 7)}
df = pd.DataFrame(data)
Description:
I have a data frame similar to the example I gave above with 1000ish columns. The column name format is as following:
test_number_family so test_1_c would be a number type of 1 and the family of "c"
I want to classify the df by column names of the same "family" type. So my final output needs to be a list of lists of same family values:
Output example:
[ [a_familily values], [b_familily values],...]
it would also look like the values of columns:
[ [test_1_a, test_2_a , test_3_a ] , [test_1_b, test_2_b , test_3_b ] , ...]
What I have:
#### transfers data frame into a sorted dict (by column name) by columns as key
col_names = [ i for (i,j) in df.iteritems() ]
col_vals = [ j for (i,j) in df.iteritems() ]
df_dict = dict(zip(col_names, col_vals))
families = np.unique([ i.split("_")[2] for i in dict1.keys() ])
I have classified each column name with its associated value and extracted the unique number of groups I want to have in the final output as "families". I now am seeking help in categorizing the data frame into a length(families) number of lists identical to the output example I have given above.
I hope my explanation has been clear, thank you for your time!
CodePudding user response:
Let's keep track of the different families in a dictionary, the keys being the letters (the families) and the values being lists holding the columns from a certain family.
Since we know that each column ends with a letter related with its family, we can use that as a key in the dictionary.
from collections import defaultdict
families = defaultdict(list)
for col in df.columns:
families[col[-1]].append(df[col])
Now for example, in families["a"]
, we have:
[0 26
1 13
2 11
3 35
4 43
5 45
6 46
Name: test_1_a, dtype: int64,
0 10
1 15
2 20
3 43
4 40
5 35
6 22
Name: test_2_a, dtype: int64,
0 35
1 48
2 38
3 13
4 3
5 10
6 25
Name: test_3_a, dtype: int64]
We can easily get a per-family dataframe with concat
.
df_a = pd.concat(families["a"], axis=1)
Gets us:
test_1_a test_2_a test_3_a
0 26 10 35
1 13 15 48
2 11 20 38
3 35 43 13
4 43 40 3
5 45 35 10
6 46 22 25
If we were to create a dictionary of dataframes per family,
dfs = {f"df_{fam}" : pd.concat(families[fam], axis=1) for fam in families.keys()}
Now, the dictionary dfs
contains:
{'df_a': test_1_a test_2_a test_3_a
0 26 10 35
1 13 15 48
2 11 20 38
3 35 43 13
4 43 40 3
5 45 35 10
6 46 22 25,
'df_b': test_1_b test_2_b test_4_b
0 18 4 44
1 48 43 2
2 30 21 4
3 46 12 16
4 42 14 25
5 22 24 13
6 43 40 43,
'df_c': test_1_c test_2_c test_4_c
0 25 15 5
1 36 39 28
2 6 3 37
3 22 48 16
4 2 34 25
5 39 16 30
6 32 36 2}
CodePudding user response:
What do you think of an approach like that ? Use pd.wide_to_long
with the result of a long dataframe with splitted columns, one with the whole classification
like 1_a
, one only with the number, one with the family and their values.
df = (pd.wide_to_long(
df.reset_index(),stubnames='test_',i='index',j='classification',suffix='\d_\w')
.reset_index()
.drop('index',axis=1)
.rename(columns={'test_':'values'}))
df[['number', 'family']] = df['classification'].str.split('_', expand=True)
df = df.reindex(columns=['classification', 'number', 'family', 'values'])
print(df)
classification number family values
0 1_a 1 a 29
1 1_a 1 a 46
2 1_a 1 a 2
3 1_a 1 a 6
4 1_a 1 a 16
.. ... ... ... ...
58 4_c 4 c 30
59 4_c 4 c 23
60 4_c 4 c 26
61 4_c 4 c 40
62 4_c 4 c 39
Easy to groupby or filter for more analysis.
If you want to get dicts
or lists
of specific data, here some examples:
filter1 = df.loc[df['classification']=='1_a',:]
filter2 = df.loc[df['number']=='2','values']
filter1.to_dict(orient='list')
Output:
{'classification': ['1_a', '1_a', '1_a', '1_a', '1_a', '1_a', '1_a'],
'number': ['1', '1', '1', '1', '1', '1', '1'],
'family': ['a', 'a', 'a', 'a', 'a', 'a', 'a'],
'values': [29, 46, 2, 6, 16, 12, 38]}
filter2.tolist()
Output:
[8, 2, 43, 9, 5, 30, 28, 26, 25, 49, 3, 1, 47, 44, 16, 9, 8, 15, 24, 36, 1]