How to extract values from one column and create separate binary columns based on the targeted colum-CodePudding

I have a dataset as follows. reason is the only column given, other columns are the outputs I want to have

reason           business_name  name  individual_name   DOB
business name       Yes          No       No             No
name                No           Yes      No             No
business name       Yes          No       No             No
individual_name     No           No       Yes            No
DOB                 No           No       No             Yes
Business name,name  Yes          Yes      No             No

The reason field is my only column and I want to create several separate columns to store the results in a binary format.

The current code looks stupid. In real data, I have more than 10 unique values for the reason column. I created 10 keywords list to store the reason keywords, and 10 empty lists to use for append ('Yes') or ('No') Sample logic:

for comment in  df['reason'] :
    if any(x in comment for x in keywords1):
        lis1.append('Yes')
    else:
        lis1.append('No')
         .
         .

However, when scanning the value as name, 
both the business_name column and name will be yes. I think because the name both exists in keywords1 and keyword2.
keywords1=['business name'] keyword2 =['name']

That's not what I want actually, I want it to be separated only if reason has values: business name, name. Not sure how to solve it and to reduce manually creating 10 lists.

Thanks in advance!

CodePudding user response：

Explanations first, final code to follow

Get a list of true unique reasons. You can drop anycon NA's here by use of the dropna()

import pandas as pd
from itertools import chain

# you can probably skip this list if you already have the dataframe
reasons = [       
    'business name'  ,
    np.nan,    
    'name'   ,            
    'business name'   ,    
    'individual_name' ,    
    'DOB'     ,            
    'Business name,name']
    
    df = pd.DataFrame(reasons)
    df.columns=['reason']
    
    unique_reasons = pd.unique(df.reason.dropna()).tolist()
    
    # get any item that has a comma, and split it into separate pieces
    splits = [x.split(',') for x in unique_reasons if ',' in x]
    #take out all the items you just split from the main list
    unique_reasons =[y for y in unique_reasons if ',' not in y]
    # combine the two lists, and make sure that each item in final combined list is only in there one time
    new_list = unique_reasons   list(chain.from_iterable(splits))
    
    unique_reasons_set = set(new_list)

Make a boolean mask for every item in the unique_reason_set, if the df['reason'] contains that item as a string, then write True, otherwise, False.

import numpy as np

new_cols = []
    for item in unique_reasons_set:
        col = np.where(df['reason'].str.contains(item), True, False)
        print(col)
        new_cols.append(col)

Take all of those new columns and concatenate them to original dataframe

    df2 = pd.DataFrame.from_dict(dict(zip(unique_reasons_set, new_cols)))
    df = pd.concat([df,df2], axis=1)

Complete Code

import pandas as pd
import numpy as np
from itertools import chain
reasons = [       
'business name'  ,
np.nan,    
'name'   ,            
'business name'   ,    
'individual_name' ,    
'DOB'     ,            
'Business name,name']

df = pd.DataFrame(reasons)
df.columns=['reason']

unique_reasons = pd.unique(df.reason.dropna()).tolist()

# get any item that has a comma, and split it into separate pieces
splits = [x.split(',') for x in unique_reasons if ',' in x]
#take out all the items you just split from the main list
unique_reasons =[y for y in unique_reasons if ',' not in y]
# combine the two lists, and make sure that each item in final combined list is only in there one time. Need the chain to flatten a 2d list that results from split
new_list = unique_reasons   list(chain.from_iterable(splits))

unique_reasons_set = set(new_list)


new_cols = []
for item in unique_reasons_set:
    col = np.where(df['reason'].str.contains(item), True, False)
    print(col)
    new_cols.append(col)


df2 = pd.DataFrame.from_dict(dict(zip(unique_reasons_set, new_cols)))
df = pd.concat([df,df2], axis=1)