The Problem

I had a hard time phrasing this question but essentially I have a series of X columns that represent weights at specific points in time. Then another set of X columns that represent the names of those people that were measured.

That table looks like this (there's more than two columns, this is just a toy example):

a_weight	b_weight	a_name	b_name
10	5	John	Michael
1	2	Jake	Michelle
21	3	Alice	Bob
2	1	Ashley	Brian

What I Want

I want to have a two columns with the maximum weight and name at each point in time. I want this to be vectorized because the data is a lot. I can do it using a for loop or an .apply(lambda row: row[col]) but it is very slow.

So the final table would look something like this:

a_weight	b_weight	a_name	b_name	max_weight	max_name
10	5	John	Michael	a_weight	John
1	2	Jake	Michelle	b_weight	Michelle
21	3	Alice	Bob	a_weight	Alice
2	1	Ashley	Brian	a_weight	Ashley

What I've Tried

I've been able to create a mirror df_subset with just the weights, then use the idxmax function to make a max_weight column:

df_subset = df[[c for c in df.columns if "weight" in c]]
max_weight_col = df_subset.idxmax(axis="columns")

This returns a column that is the max_weight column in the section above. Now I run:

df["max_name_col"] = max_weight_col.str.replace("_weight","_name")

and I have this:

a_weight	b_weight	a_name	b_name	max_weight	max_name_col
10	5	John	Michael	a_weight	a_name
1	2	Jake	Michelle	b_weight	b_name
21	3	Alice	Bob	a_weight	a_name
2	1	Ashley	Brian	a_weight	a_name

I basically want to run a code similar to the one below without a for-loop:

df["max_name"] = [row[row["max_name_col"]] for row in df]

How do I move on from here? I feel like I'm so close but I'm stuck. Any help? I'm also open to throwing away the entire code and doing something else if there's a faster way.

CodePudding user response：

You can do that for sure just pass to numpy argmax

v1 = df.filter(like='weight').values
v2 = df.filter(like='name').values
df['max_weight'] = v1[df.index, v1.argmax(1)]
df['max_name'] = v2[df.index, v1.argmax(1)]
df
Out[921]: 
   a_weight  b_weight  a_name    b_name  max_weight  max_name
0        10         5    John   Michael          10      John
1         1         2    Jake  Michelle           2  Michelle
2        21         3   Alice       Bob          21     Alice
3         2         1  Ashley     Brian           2    Ashley

CodePudding user response：

This would do the trick assuming you only have 2 weight columns:

df["max_weight"] = df[["a_weight", "b_weight"]].idxmax(axis=1)
mask = df["max_weight"] == "a_weight"
df.loc[mask, "max_name"] = df[mask]["a_name"]
df.loc[~mask, "max_name"] = df[~mask]["b_name"]

CodePudding user response：

We could use idxmax to find the column names; then use factorize numpy advanced indexing to get the names:

df['max_weight'] = df.loc[:, df.columns.str.contains('weight')].idxmax(axis=1)
df['max_name'] = (df.loc[:, df.columns.str.contains('name')].to_numpy()
                  [np.arange(len(df)), df['max_weight'].factorize()[0]])

Output:

   a_weight  b_weight  a_name    b_name max_weight  max_name
0        10         5    John   Michael   a_weight      John
1         1         2    Jake  Michelle   b_weight  Michelle
2        21         3   Alice       Bob   a_weight     Alice
3         2         1  Ashley     Brian   a_weight    Ashley