Home > database >  Extract values in dataframe where index name equals to column name
Extract values in dataframe where index name equals to column name

Time:12-11

Lets say I have the following data set:

import numpy as np
import pandas as pd

d = {'column1': ['a', 'b', 'c'], 'a': [10, 8, 6], 'a1': [1, 2, 3], 'b': [4, 2, 6], 'b1': [1, 4, 8], 'c': [2, 6, 8], 'c1': [2, 1, 8] }

data_frame  = pd.DataFrame(data=d).set_index('column1')

What I want to achieve is following. Sum each row values, excluding the observations where a=a, a=a1, b=b, b=b1, and so on.

So in the final dataset I want to have something like this:

f={'total': [9, 17, 23]}
final_frame = pd.DataFrame(data=f)

Where 9 = b b1 c c1 and so on.

Obvisuly, I can achieve this by iloc[] command on every row. But my real dataframe is quite huge, and as you can see the position of ij elements which need to be dropped are not constant across the rows (so every iloc on each row will be different, but consistent sequence).

Any suggestions?

Best,

CodePudding user response:

Solution with DataFrame.melt, comapre indices by columns names without numbers and for not equal aggregate sum:

df = data_frame.melt(ignore_index=False)
df = (df.loc[df['variable'].str.extract('(\D )', expand=False).ne(df.index), 'value']
        .groupby(level=0)
        .sum()
        .reset_index(name='total'))
print (df)
  column1  total
0       a      9
1       b     17
2       c     23

Another idea:

df = data_frame.copy()
df.columns = df.columns.str.extract('(\D )', expand=False)
s = df.groupby(level=0, axis=1).sum().stack()

df = s[[a != b for a, b in s.index]].groupby(level=0).sum().reset_index(name='total')
print (df)
  column1  total
0       a      9
1       b     17
2       c     23

CodePudding user response:

A faster option, in my opinion, would be to avoid increasing the number of rows (tests are the only way to confirm this):

data_frame.columns =data_frame.columns.str.split(r'(\d )', expand = True).droplevel(-1)
temp = data_frame.set_index([np.arange(len(data_frame))], append=True).unstack(level=0)
filters = [col for col in temp if col[0] != col[-1]]
out = temp.loc[:, filters].sum(1)
out.to_frame(name='total')
   total
0   9.0
1  17.0
2  23.0
  • Related