Find percent difference between two columns, that share same root name, but differ in suffix-CodePudding

My question is somewhat similar to subtracting-two-columns-named-in-certain-pattern

I'm having trouble performing operations on columns that share the same root substring, without a loop. Basically I want to calculate a percentage change using columns that end with '_PY' with another column that shares the same name except for the suffix.

What's a possible one line solution, or one that doesn't involve a for loop?

url = r'https://www2.arccorp.com/globalassets/forms/corpstats.csv?1653338666304'
df = pd.read_csv(url)
df = df[df['TYPE'] == 'M']

PY_cols = [col for col in df.columns if col.endswith("PY")]
reg_cols = [col.split("_PY")[0] for col in PY_cols]

for k,v in zip(reg_cols,PY_cols):
    df[f"{k}_YOY%"] = round((df[k] - df[v]) / df[v] * 100,2)
    
df

CodePudding user response：

You can use:

v = (df[df.columns[df.columns.str.endswith('_PY')]]
       .rename(columns=lambda x: x.rsplit('_', maxsplit=1)[0]))
k = df[v.columns]

out = pd.concat([df, k.sub(v).div(v).mul(100).round(2).add_suffix('_YOY%')], axis=1)

CodePudding user response：

Gotta subset the df into the columns you need. Then zip will pull the pairs you need to do the percent calculation.

url = r'https://www2.arccorp.com/globalassets/forms/corpstats.csv?1653338666304'
df = pd.read_csv(url)
df = df[df['TYPE'] == 'M']

df_cols = [col for col in df.columns]
PY_cols = [col for col in df.columns if col.endswith("PY")]
# find the matching column, where the names match without the suffix.
PY_use = [col for col in PY_cols if col.split("_PY")[0] in df_cols]
df_use = [col.split("_PY")[0] for col in PY_use]  

for k,v in zip(df_use,PY_use):
    df[f"{k}_YOY%"] = round((df[k] - df[v]) / df[v] * 100,2)

CodePudding user response：

You can take advantage of numpy:

py_df_array = (df[df_use].values, df[PY_use].values)
perc_dif = np.round((py_df_array[0] - py_df_array[1]) / py_df_array[1] * 100, 2)
df_perc = pd.DataFrame(perc_def, columns=[f"{col}_YOY%" for col in df_use])
df = pd.concat([df, df_perc], axis=1)