The Problem
I had a hard time phrasing this question but essentially I have a series of X
columns that represent weights at specific points in time. Then another set of X
columns that represent the names of those people that were measured.
That table looks like this (there's more than two columns, this is just a toy example):
a_weight | b_weight | a_name | b_name |
---|---|---|---|
10 | 5 | John | Michael |
1 | 2 | Jake | Michelle |
21 | 3 | Alice | Bob |
2 | 1 | Ashley | Brian |
What I Want
I want to have a two columns with the maximum weight and name at each point in time. I want this to be vectorized because the data is a lot. I can do it using a for loop or an .apply(lambda row: row[col])
but it is very slow.
So the final table would look something like this:
a_weight | b_weight | a_name | b_name | max_weight | max_name |
---|---|---|---|---|---|
10 | 5 | John | Michael | a_weight | John |
1 | 2 | Jake | Michelle | b_weight | Michelle |
21 | 3 | Alice | Bob | a_weight | Alice |
2 | 1 | Ashley | Brian | a_weight | Ashley |
What I've Tried
I've been able to create a mirror df_subset
with just the weights, then use the idxmax
function to make a max_weight
column:
df_subset = df[[c for c in df.columns if "weight" in c]]
max_weight_col = df_subset.idxmax(axis="columns")
This returns a column that is the max_weight
column in the section above. Now I run:
df["max_name_col"] = max_weight_col.str.replace("_weight","_name")
and I have this:
a_weight | b_weight | a_name | b_name | max_weight | max_name_col |
---|---|---|---|---|---|
10 | 5 | John | Michael | a_weight | a_name |
1 | 2 | Jake | Michelle | b_weight | b_name |
21 | 3 | Alice | Bob | a_weight | a_name |
2 | 1 | Ashley | Brian | a_weight | a_name |
I basically want to run a code similar to the one below without a for-loop:
df["max_name"] = [row[row["max_name_col"]] for row in df]
How do I move on from here? I feel like I'm so close but I'm stuck. Any help? I'm also open to throwing away the entire code and doing something else if there's a faster way.
CodePudding user response:
You can do that for sure just pass to numpy
argmax
v1 = df.filter(like='weight').values
v2 = df.filter(like='name').values
df['max_weight'] = v1[df.index, v1.argmax(1)]
df['max_name'] = v2[df.index, v1.argmax(1)]
df
Out[921]:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael 10 John
1 1 2 Jake Michelle 2 Michelle
2 21 3 Alice Bob 21 Alice
3 2 1 Ashley Brian 2 Ashley
CodePudding user response:
This would do the trick assuming you only have 2 weight columns:
df["max_weight"] = df[["a_weight", "b_weight"]].idxmax(axis=1)
mask = df["max_weight"] == "a_weight"
df.loc[mask, "max_name"] = df[mask]["a_name"]
df.loc[~mask, "max_name"] = df[~mask]["b_name"]
CodePudding user response:
We could use idxmax
to find the column names; then use factorize
numpy advanced indexing to get the names:
df['max_weight'] = df.loc[:, df.columns.str.contains('weight')].idxmax(axis=1)
df['max_name'] = (df.loc[:, df.columns.str.contains('name')].to_numpy()
[np.arange(len(df)), df['max_weight'].factorize()[0]])
Output:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael a_weight John
1 1 2 Jake Michelle b_weight Michelle
2 21 3 Alice Bob a_weight Alice
3 2 1 Ashley Brian a_weight Ashley