Home > Back-end >  Python: merge columns with same name, keeping minimum value
Python: merge columns with same name, keeping minimum value

Time:04-08

I have a big matrix, like this:

df:
    A    A    A    B    B  ... (column names) 
 A  2    4    5     9    2
 A  6    8    7     6    4
 A  5    2    6     4    5
 B  3    4    1     3    4
 B  4    5    3     1    4

 .
 .
(row names)

I would like to merge the columns with same name, and findig the minimum value. At the end I would like to have a matrix like this:

  df_min:
    A   B  ... (column names) 
 A  2    2
 A  6    4
 A  2    4     
 B  1    3     
 B  3    1     
 .
 .
(row names)

My intentions, afterwards (outside of the question), is to merge the rows as well. Desired outcome:

  df_min:
    A   B  ... (column names) 
 A  2    2   
 B  1    1     
 .
 .
(row names)

I tried this:

df_min= df.groupby('df.columns, axis=1').agg(np.min)

But it didn't work, it removed some rows (for example, removing entirely row A)... EDIT: Apparently, it worked fine but I had two columns with different names but whitespace at the end of the name. These methods reorder the columns, which confused me.

A snipped of the dataframe:

The data comes from the Chesss Database

CodePudding user response:

Simply groupby on the level=0 for each axis:

df.groupby(level=0, axis=1).min()

output:

   A  B
A  2  2
A  6  4
A  2  4
B  1  3
B  3  1

both axes:

df.groupby(level=0, axis=1).min().groupby(level=0).min()

output:

   A  B
A  2  2
B  1  1

Alternatively, use a single groupby trough a stack/unstack:

df.stack().groupby(level=[0,1]).min().unstack()

output:

   A  B
A  2  2
B  1  1

CodePudding user response:

EDIT

numpy only based solution


I'm assuming that you have a list associating names to column indices, e.g. for the first code sample you provided something like

column_names = ['A', 'A', 'A', 'B', 'B']

and that your data type is single-precision floating point. In this scenario, you can do something like the following:

unique_column_names = list(dict.fromkeys(column_names)) # get unique column names preserving original order
df_min = np.empty((df.shape[0], len(unique_column_names), dtype=np.float32) # allocate output array

for i, column_name in enumerate(unique_column_names): # iterate over unique column names
   column_indices = [id for id in range(df.shape[1]) if column_names[id] == column_name] # extract all column indices having the same name
   tmp = df[:, column_indices] # extract columns named as column_name
   df_min[:, i] = np.amin(tmp, axis=1)] # take min by row and save result

Then, if you want to repeat the process by row, assuming you have another list associating row indices and names named row_names

unique_row_names = list(dict.fromkeys(row_names)) # get unique row names preserving order
df_final = np.empty((len(unique_row_names), len(unique_column_names),  dtype=np.float32) # allocate final output

for j, row_name in enumerate(unique_row_names): # iterate over unique row names
    row_indices = [id for id in range(df.shape[0]) if row_names[id] == row_name] # extract rows having row_name
    tmp = df_min[row_indices, :] # extract rows named as row_name from the column-reduced matrix
    df_final[j, :] = np.amin(tmp, axis=0) # take min by column and save result

The column-name and row-name association list for the final output are unique_column_names and unique_row_names

  • Related