Home > front end >  detect column names and create new columns based on a custom function
detect column names and create new columns based on a custom function

Time:06-30

  • I am trying to make 'n' = 1, 2, 3.. number of columns named t_{n}
  • t(1, 2, 3...) = p(1, 2, 3...) - current respectively
  • example: t1 = p1 - current
  • My problem is data is kinda messy and p1, p2, p3 can lie any .iloc (index) in the dataset. but the p{n} will always be there.
  • I dont want to make a column if p lies in the name of other columns.
  • Not sure how to best proceed further.
current w p1 p2 p3 other p4 t1 t2 t3
1.765 84 1.765 1.765 1.765 ad 1.765
6.527 79 8.91 8.936 8.916 ad 8.82
9.429 79 8.91 8.936 8.916 ad 8.82
import pandas as pd
import numpy as np

data_1 = {'p1': [1, 2, 3],
          'p2': [4, 5, 6],
          'p3': [3, 70 ,6],
          'p4': [40, 8, 500],
          'current': [500, 600, 700],
          't1': ['x', 'y', 'z']
         }

df1 = pd.DataFrame(data = data_1)
df1

CodePudding user response:

Use DataFrame.filter wth regex for seelct all column with pattern p with digit, then subtract from right side by DataFrame.rsub and rename columns names:

f = lambda x: x.replace('p','t')
df = df1.filter(regex='^p\d').rsub(df1['current'], axis=0).rename(columns=f)
print (df)
    t1   t2   t3   t4
0  499  496  497  460
1  598  595  530  592
2  697  694  694  200

If need append columns to original first remove t1 column (because in sample data):

f = lambda x: x.replace('p','t')
df = df1.drop('t1',1).join(df1.filter(regex='^p\d').rsub(df1['current'], axis=0).rename(columns=f))
print (df)
   p1  p2  p3   p4  current   t1   t2   t3   t4
0   1   4   3   40      500  499  496  497  460
1   2   5  70    8      600  598  595  530  592
2   3   6   6  500      700  697  694  694  200

Details:

print (df1.filter(regex='^p\d'))
   p1  p2  p3   p4
0   1   4   3   40
1   2   5  70    8
2   3   6   6  500

EDIT: If need select only consecutive p columns use:

data_1 = {'p1': [1, 2, 3],
          'p2': [4, 5, 6],
          'p3': [3, 70 ,6],
          'p4': [40, 8, 500],
          'current': [500, 600, 700],
          'p5': [4, 8, 500],
          'p6': [40, 80, 50],
          'other': ['x', 'y', 'z'],
          'p7': [4, 8, 5],
         }

df1 = pd.DataFrame(data = data_1)
print (df1)
   p1  p2  p3   p4  current   p5  p6 other  p7
0   1   4   3   40      500    4  40     x   4
1   2   5  70    8      600    8  80     y   8
2   3   6   6  500      700  500  50     z   5

pos = df1.columns.get_indexer(df1.filter(regex='^p\d').columns)
print (pos)
[0 1 2 3 5 6 8]

s = pd.Series(pos).diff().ne(1).cumsum().duplicated(keep=False)
print (s)
0     True
1     True
2     True
3     True
4     True
5     True
6    False
dtype: bool

df = df1.iloc[:, pos[s]]
print (df)
   p1  p2  p3   p4   p5  p6
0   1   4   3   40    4  40
1   2   5  70    8    8  80
2   3   6   6  500  500  50
  • Related