Home > database >  Find percent difference between two columns, that share same root name, but differ in suffix
Find percent difference between two columns, that share same root name, but differ in suffix

Time:05-25

My question is somewhat similar to subtracting-two-columns-named-in-certain-pattern

I'm having trouble performing operations on columns that share the same root substring, without a loop. Basically I want to calculate a percentage change using columns that end with '_PY' with another column that shares the same name except for the suffix.

What's a possible one line solution, or one that doesn't involve a for loop?

url = r'https://www2.arccorp.com/globalassets/forms/corpstats.csv?1653338666304'
df = pd.read_csv(url)
df = df[df['TYPE'] == 'M']

PY_cols = [col for col in df.columns if col.endswith("PY")]
reg_cols = [col.split("_PY")[0] for col in PY_cols]

for k,v in zip(reg_cols,PY_cols):
    df[f"{k}_YOY%"] = round((df[k] - df[v]) / df[v] * 100,2)
    
df
    

CodePudding user response:

You can use:

v = (df[df.columns[df.columns.str.endswith('_PY')]]
       .rename(columns=lambda x: x.rsplit('_', maxsplit=1)[0]))
k = df[v.columns]

out = pd.concat([df, k.sub(v).div(v).mul(100).round(2).add_suffix('_YOY%')], axis=1)

CodePudding user response:

Gotta subset the df into the columns you need. Then zip will pull the pairs you need to do the percent calculation.

url = r'https://www2.arccorp.com/globalassets/forms/corpstats.csv?1653338666304'
df = pd.read_csv(url)
df = df[df['TYPE'] == 'M']

df_cols = [col for col in df.columns]
PY_cols = [col for col in df.columns if col.endswith("PY")]
# find the matching column, where the names match without the suffix.
PY_use = [col for col in PY_cols if col.split("_PY")[0] in df_cols]
df_use = [col.split("_PY")[0] for col in PY_use]  

for k,v in zip(df_use,PY_use):
    df[f"{k}_YOY%"] = round((df[k] - df[v]) / df[v] * 100,2)

CodePudding user response:

You can take advantage of numpy:

py_df_array = (df[df_use].values, df[PY_use].values)
perc_dif = np.round((py_df_array[0] - py_df_array[1]) / py_df_array[1] * 100, 2)
df_perc = pd.DataFrame(perc_def, columns=[f"{col}_YOY%" for col in df_use])
df = pd.concat([df, df_perc], axis=1)
  • Related