I have the table below with codes. I would like to create two new columns. 1 that identifies any code with the letters YYY and another that identifies the letters WWW as seen in the intermediate table. After that, I would like to aggregate these and have a ID's columns with all the YYY codes and WWW codes and their totals.
I am a bit new to python. I am trying to get to the final table below but am stuck trying to get to the intermediate table and have provided my code below. I am receiving a KeyError: 'code':
#for YYY
def categorise(y):
if y['Code'].str.contains('YYY'):
return 1
return 0
df1['Code'] = df.apply(lambda y: categorise(y), axis=1)
#for WWW
def categorise(w):
if w['Code'].str.contains('WWW'):
return 1
return 0
df1['Code'] = df.apply(lambda w: categorise(w), axis=1)
Any help would be appreciated on this.
Current Table:
Code |
---|
001,ABC,123,YYY |
002,ABC,546,WWW |
003,ABC,342,WWW |
004,ABC,635,YYY |
Intermediate Table:
Code | Location_Y | Location_W |
---|---|---|
001,ABC,123,YYY | 1 | 0 |
002,ABC,546,WWW | 0 | 1 |
003,ABC,342,WWW | 0 | 1 |
004,ABC,635,YYY | 1 | 0 |
Final Table:
IDs | Location_Y | Location_W |
---|---|---|
001,ABC,123,YYY - 004,ABC,635,YYY | 2 | 0 |
002,ABC,546,WWW - 003,ABC,342,WWW | 0 | 2 |
Any help would be appreciated
CodePudding user response:
# assuming the string of interest is the last under the code column
df['id'] = df['Code'].str.rsplit(',', n=1, expand=True)[1]
# create columns with 1 or 0 if string exists in the Code
df['Location_Y'] = df['id'].eq('YYY').astype(int)
df['Location_W'] = df['id'].eq('WWW').astype(int)
# groupby to get the aggregates
df.groupby('id', as_index=False).agg({'Code' : ' - '.join,
'Location_Y': sum,
'Location_W': sum
})[['Code', 'Location_Y', 'Location_W']]
Code Location_Y Location_W
0 002,ABC,546,WWW - 003,ABC,342,WWW 0 2
1 001,ABC,123,YYY - 004,ABC,635,YYY 2 0
CodePudding user response:
Extract the last elements, get_dummies. Group by the elements and agggregate getting sums and joining as required. Coded below
df=df.assign(coded=df['Code'].str.split('\,').str[-1])
#intermediate
df=df.assign(coded=df['Code'].str.split('\,').str[-1])
s = df.join(pd.get_dummies(df['coded']))
Code coded WWW YYY
0 001,ABC,123,YYY YYY 0 1
1 002,ABC,546,WWW WWW 1 0
2 003,ABC,342,WWW WWW 1 0
3 004,ABC,635,YYY YYY 0 1
#Final
s.groupby('coded').agg(**{'Code':('Code', lambda x: x.str.cat(sep='-')),'Y':('YYY', 'sum'),'W':('WWW', 'sum')}).reset_index().drop(columns='coded')
Code Y W
0 002,ABC,546,WWW-003,ABC,342,WWW 0 2
1 001,ABC,123,YYY-004,ABC,635,YYY 2 0
CodePudding user response:
Well, unlike others I would like to suggest you to use regex and do it in simpler way if you are beginner.
So, for intermediate table do something like this:
import pandas as pd
import re
df = pd.read_csv('test_table.csv')
yyy = []
www = []
for index, row in df.iterrows():
val_y = re.search('YYY', row['test data'])
if val_y is None:
yyy.append(0)
else:
yyy.append(1)
val_w = re.search('WWW', row['test data'])
if val_w is None:
www.append(0)
else:
www.append(1)
df['Location_Y'] = yyy
df['Location_W'] = www
print(df)
and for Final, change that for loop like this
for index, row in df.iterrows():
val_y = row['test data'].count('YYY')
yyy.append(val_y)
val_w = row['test data'].count('WWW')
www.append(val_w)