Home > Back-end >  How to extract values from one column and create separate binary columns based on the targeted colum
How to extract values from one column and create separate binary columns based on the targeted colum

Time:10-07

I have a dataset as follows. reason is the only column given, other columns are the outputs I want to have

reason           business_name  name  individual_name   DOB
business name       Yes          No       No             No
name                No           Yes      No             No
business name       Yes          No       No             No
individual_name     No           No       Yes            No
DOB                 No           No       No             Yes
Business name,name  Yes          Yes      No             No

The reason field is my only column and I want to create several separate columns to store the results in a binary format.

The current code looks stupid. In real data, I have more than 10 unique values for the reason column. I created 10 keywords list to store the reason keywords, and 10 empty lists to use for append ('Yes') or ('No') Sample logic:

for comment in  df['reason'] :
    if any(x in comment for x in keywords1):
        lis1.append('Yes')
    else:
        lis1.append('No')
         .
         .

However, when scanning the value as name, 
both the business_name column and name will be yes. I think because the name both exists in keywords1 and keyword2.
keywords1=['business name'] keyword2 =['name'] 

That's not what I want actually, I want it to be separated only if reason has values: business name, name. Not sure how to solve it and to reduce manually creating 10 lists.

Thanks in advance!

CodePudding user response:

Explanations first, final code to follow

Get a list of true unique reasons. You can drop anycon NA's here by use of the dropna()

import pandas as pd
from itertools import chain

# you can probably skip this list if you already have the dataframe
reasons = [       
    'business name'  ,
    np.nan,    
    'name'   ,            
    'business name'   ,    
    'individual_name' ,    
    'DOB'     ,            
    'Business name,name']
    
    df = pd.DataFrame(reasons)
    df.columns=['reason']
    
    unique_reasons = pd.unique(df.reason.dropna()).tolist()
    
    # get any item that has a comma, and split it into separate pieces
    splits = [x.split(',') for x in unique_reasons if ',' in x]
    #take out all the items you just split from the main list
    unique_reasons =[y for y in unique_reasons if ',' not in y]
    # combine the two lists, and make sure that each item in final combined list is only in there one time
    new_list = unique_reasons   list(chain.from_iterable(splits))
    
    unique_reasons_set = set(new_list)

Make a boolean mask for every item in the unique_reason_set, if the df['reason'] contains that item as a string, then write True, otherwise, False.

import numpy as np

new_cols = []
    for item in unique_reasons_set:
        col = np.where(df['reason'].str.contains(item), True, False)
        print(col)
        new_cols.append(col)
    

Take all of those new columns and concatenate them to original dataframe

    df2 = pd.DataFrame.from_dict(dict(zip(unique_reasons_set, new_cols)))
    df = pd.concat([df,df2], axis=1)

Complete Code

import pandas as pd
import numpy as np
from itertools import chain
reasons = [       
'business name'  ,
np.nan,    
'name'   ,            
'business name'   ,    
'individual_name' ,    
'DOB'     ,            
'Business name,name']

df = pd.DataFrame(reasons)
df.columns=['reason']

unique_reasons = pd.unique(df.reason.dropna()).tolist()

# get any item that has a comma, and split it into separate pieces
splits = [x.split(',') for x in unique_reasons if ',' in x]
#take out all the items you just split from the main list
unique_reasons =[y for y in unique_reasons if ',' not in y]
# combine the two lists, and make sure that each item in final combined list is only in there one time. Need the chain to flatten a 2d list that results from split
new_list = unique_reasons   list(chain.from_iterable(splits))

unique_reasons_set = set(new_list)


new_cols = []
for item in unique_reasons_set:
    col = np.where(df['reason'].str.contains(item), True, False)
    print(col)
    new_cols.append(col)


df2 = pd.DataFrame.from_dict(dict(zip(unique_reasons_set, new_cols)))
df = pd.concat([df,df2], axis=1)
  • Related