Home > Mobile >  Grouping columns if their values are the same
Grouping columns if their values are the same

Time:04-09

I'm new to Python and had a question, would appreciate if someone could please help:

I have a dataset that looks like this:

Chem1 ChemValue1 Chem2 ChemValue2 Chem3 ChemValue3
Carbon Monoxide 32 Carbon Dioxide 32 Carbon 45
Sulfur 32 Sulfur 32 Sulfur 45
Carbon 32 Carbon 32 Iron 45
Carbon 32 Iron 32 Iron 45

If Chem1 equals Chem2 and/or Chem 3, I want to add up the corresponding ChemValue. So I get a dataset like this:

Chem1 ChemValue1 Chem2 ChemValue2 Chem3 ChemValue3
Carbon Monoxide 32 Carbon Dioxide 32 Carbon 45
Sulfur 109 - - - -
Carbon 64 - - Iron 45
Carbon 32 Iron 77 - -

Then I want to just get the Chem with the highest corresponding ChemValue (it's fine if I don't get the above table as an intemediary, just fine with getting the max aggregate Chem):

Highest Chem
Carbon Monoxide
Sulfur
Carbon
Iron

How can I do this in Python?

#here is the first dataset

data = {'Chem1': ['Carbon Monoxide','Sulfur','Carbon','Carbon'], 'ChemValue1': [32,32,32,32],'Chem2': ['Carbon Dioxide','Sulfur','Carbon','Iron'],'ChemValue2': [32,32,32,32],'Chem3': ['Carbon','Sulfur','Iron','Iron'],'ChemValue3': [45,45,45,45]}  
df = pd.DataFrame(data)

CodePudding user response:

This do need more work than we think , the 2nd df can be done while we get the 1stdf

s = pd.wide_to_long(df.reset_index(),stubnames=['Chem','ChemValue'],i = 'index',j='v').reset_index()
#use pd.wide_to_long melt the df from row to column
secdf = s.groupby(['index','Chem'])['ChemValue'].sum().reset_index(level=0).groupby('index')['ChemValue'].idxmax()
# then we can do groupby combine the same value and get the max id of index 
Out[274]:
index
0    Carbon
1    Sulfur
2    Carbon
3      Iron
Name: ChemValue, dtype: object
s = s.groupby(['index','Chem']).agg({'ChemValue':'sum','v':'first'}).reset_index(level=1).set_index('v',append=True).unstack().sort_index(axis=1,level=1)
s.columns = s.columns.map('{0[0]}{0[1]}'.format)
s
Out[288]: 
                 Chem1  ChemValue1  ...   Chem3  ChemValue3
index                               ...                    
0      Carbon Monoxide        32.0  ...  Carbon        45.0
1               Sulfur       109.0  ...     NaN         NaN
2               Carbon        64.0  ...    Iron        45.0
3               Carbon        32.0  ...     NaN         NaN
[4 rows x 6 columns]

CodePudding user response:

This should get the final result that the question asks for:

import pandas as pd
data = {'Chem1': ['Carbon Monoxide','Sulfur','Carbon','Carbon'], 'ChemValue1': [32,32,32,32],'Chem2': ['Carbon Dioxide','Sulfur','Carbon','Iron'],'ChemValue2': [32,32,32,32],'Chem3': ['Carbon','Sulfur','Iron','Iron'],'ChemValue3': [45,45,45,45]}  
df = pd.DataFrame(data)
print(df)

from collections import defaultdict
highestChem, highestChemValue = [], []
def aggregateChems(x):
    dct = defaultdict(int)
    for k, v in (('Chem1', 'ChemValue1'), ('Chem2', 'ChemValue2'), ('Chem3', 'ChemValue3')):
        dct[x[k]]  = x[v]
    highestChem.append(max(dct, key=lambda x: dct[x]))
    highestChemValue.append(dct[highestChem[-1]])
df.apply(aggregateChems, axis = 1)

df2 = pd.DataFrame({'HighestChem' : highestChem, 'ChemValue' : highestChemValue})
print(f"\n{df2}")

df3 = df2['HighestChem'].to_frame()
print(f"\n{df3}")

Output:

             Chem1  ChemValue1           Chem2  ChemValue2   Chem3  ChemValue3
0  Carbon Monoxide          32  Carbon Dioxide          32  Carbon          45
1           Sulfur          32          Sulfur          32  Sulfur          45
2           Carbon          32          Carbon          32    Iron          45
3           Carbon          32            Iron          32    Iron          45

  HighestChem  ChemValue
0      Carbon         45
1      Sulfur        109
2      Carbon         64
3        Iron         77

  HighestChem
0      Carbon
1      Sulfur
2      Carbon
3        Iron

CodePudding user response:

I took a more manual approach that works fine for 3 Chem but won't scale well

c12 = df['Chem1'].eq(df['Chem2'])
c13 = df['Chem1'].eq(df['Chem3'])
df.loc[c12, 'ChemValue1'] = df['ChemValue1']   df['ChemValue2']
df.loc[c13, 'ChemValue1'] = df['ChemValue1']   df['ChemValue3']

c23 = df['Chem2'].eq(df['Chem3'])
df.loc[c23, 'ChemValue2'] = df['ChemValue2']   df['ChemValue3']
pd.wide_to_long(df.reset_index(),stubnames=['Chem','ChemValue'],i = 'index',j='k').reset_index(drop=True)
               Chem  ChemValue
0   Carbon Monoxide         32
1            Sulfur        109
2            Carbon         64
3            Carbon         32
4    Carbon Dioxide         32
5            Sulfur         77
6            Carbon         32
7              Iron         77
8            Carbon         45
9            Sulfur         45
10             Iron         45
11             Iron         45
  • Related