Home > Enterprise >  Creating new column based on specific part of a string using pandas
Creating new column based on specific part of a string using pandas

Time:11-02

I have the table below with codes. I would like to create two new columns. 1 that identifies any code with the letters YYY and another that identifies the letters WWW as seen in the intermediate table. After that, I would like to aggregate these and have a ID's columns with all the YYY codes and WWW codes and their totals.

I am a bit new to python. I am trying to get to the final table below but am stuck trying to get to the intermediate table and have provided my code below. I am receiving a KeyError: 'code':

#for YYY

def categorise(y):  
    if y['Code'].str.contains('YYY'):
        return 1
    return 0

df1['Code'] = df.apply(lambda y: categorise(y), axis=1)

#for WWW

def categorise(w):  
    if w['Code'].str.contains('WWW'):
        return 1
    return 0

df1['Code'] = df.apply(lambda w: categorise(w), axis=1)

Any help would be appreciated on this.

Current Table:

Code
001,ABC,123,YYY
002,ABC,546,WWW
003,ABC,342,WWW
004,ABC,635,YYY

Intermediate Table:

Code Location_Y Location_W
001,ABC,123,YYY 1 0
002,ABC,546,WWW 0 1
003,ABC,342,WWW 0 1
004,ABC,635,YYY 1 0

Final Table:

IDs Location_Y Location_W
001,ABC,123,YYY - 004,ABC,635,YYY 2 0
002,ABC,546,WWW - 003,ABC,342,WWW 0 2

Any help would be appreciated

CodePudding user response:

# assuming the string of interest is the last under the code column

df['id'] = df['Code'].str.rsplit(',', n=1, expand=True)[1]

# create columns with 1 or 0 if string exists in the Code
df['Location_Y'] = df['id'].eq('YYY').astype(int)
df['Location_W'] = df['id'].eq('WWW').astype(int)


# groupby to get the aggregates
df.groupby('id', as_index=False).agg({'Code' : ' - '.join,
                      'Location_Y': sum,
                      'Location_W': sum
                     })[['Code', 'Location_Y', 'Location_W']]
                                 Code   Location_Y  Location_W
0   002,ABC,546,WWW - 003,ABC,342,WWW            0           2
1   001,ABC,123,YYY - 004,ABC,635,YYY            2           0

CodePudding user response:

Extract the last elements, get_dummies. Group by the elements and agggregate getting sums and joining as required. Coded below

df=df.assign(coded=df['Code'].str.split('\,').str[-1])


 #intermediate
df=df.assign(coded=df['Code'].str.split('\,').str[-1])
s = df.join(pd.get_dummies(df['coded']))

          Code      coded  WWW  YYY
0  001,ABC,123,YYY   YYY    0    1
1  002,ABC,546,WWW   WWW    1    0
2  003,ABC,342,WWW   WWW    1    0
3  004,ABC,635,YYY   YYY    0    1

#Final

s.groupby('coded').agg(**{'Code':('Code', lambda x: x.str.cat(sep='-')),'Y':('YYY', 'sum'),'W':('WWW', 'sum')}).reset_index().drop(columns='coded')


     

     Code                           Y  W
0  002,ABC,546,WWW-003,ABC,342,WWW  0  2
1  001,ABC,123,YYY-004,ABC,635,YYY  2  0

CodePudding user response:

Well, unlike others I would like to suggest you to use regex and do it in simpler way if you are beginner.

So, for intermediate table do something like this:

import pandas as pd
import re

df = pd.read_csv('test_table.csv')

yyy = []
www = []

for index, row in df.iterrows():
    val_y = re.search('YYY', row['test data'])
    if val_y is None:
        yyy.append(0)
    else:
        yyy.append(1)

    val_w = re.search('WWW', row['test data'])
    if val_w is None:
        www.append(0)
    else:
        www.append(1)

df['Location_Y'] = yyy
df['Location_W'] = www

print(df)

and for Final, change that for loop like this

for index, row in df.iterrows():
    val_y = row['test data'].count('YYY')
    yyy.append(val_y)
    val_w = row['test data'].count('WWW')
    www.append(val_w)
  • Related