p1 | p2 | p3 | p4 | |
---|---|---|---|---|
0 | 1 | 4 | 3 | 40 |
1 | 2 | 5 | 70 | 8 |
2 | 3 | 6 | 6 | 500 |
I have a table that looks like this i want to apply regex = '^p\d'
to search for specific column names and create new columns name n{1}, n{2}, ...
n{1} = p{1}/ SQUARE ROOT OF (p{1}**2 p{2}**2 p{3}**2 ...)
n{2} = p{2}/ SQUARE ROOT OF (p{1}**2 p{2}**2 p{3}**2 ...)
and so on upto... n_{n}
index location of
p1, p2, p3
varies for different csv files. So I am trying search using regex to search column names.number of
p
rows also varies for different csv files. I am trying to stop creating new columns loop if the nextp
column does not exist. Not sure how to do it.
import pandas as pd
import numpy as np
data_1 = {'p1': [1, 2, 3],
'p2': [4, 5, 6],
'p3': [3, 70 ,6],
'p4': [40, 8, 500],
}
df1 = pd.DataFrame(data = data_1)
df1
CodePudding user response:
You can use python string function to achive the same, but since you asked for regex solution, here is how you can do using the same.
r = re.compile("^p\d")
qualified_columns = list(filter(r.match, df1.columns))
new_col_names = [re.sub(r'p(\d )', r'n\1', col) for col in qualified_columns]
df1[new_col_names] = df1[qualified_columns].div(df1.pow(2).sum(axis=1).pow(1./2), axis=0)
Result :
p1 p2 p3 p4 n1 n2 n3 n4
0 1 4 3 40 0.024799 0.099197 0.074398 0.991973
1 2 5 70 8 0.028304 0.070760 0.990643 0.113216
2 3 6 6 500 0.005999 0.011998 0.011998 0.999838
CodePudding user response:
The index location of the columns does not matter if you are reading in the column names correctly.
pcols = [col for col in df1.columns if col.startswith('p')]
pnums = [col.split('p')[-1] for col in pcols]
pcols_rss = df1[pcols].apply(lambda row: np.sqrt(np.sum(row**2)), axis=1)
for col, num in zip(pcols, pnums):
df1.loc[:, f'n{{{num}}}'] = df1[col] / pcols_rss
Output
p1 p2 p3 p4 n{1} n{2} n{3} n{4}
0 1 4 3 40 0.024799 0.099197 0.074398 0.991973
1 2 5 70 8 0.028304 0.070760 0.990643 0.113216
2 3 6 6 500 0.005999 0.011998 0.011998 0.999838