I have a dataset as follows. reason is the only column given, other columns are the outputs I want to have
reason business_name name individual_name DOB
business name Yes No No No
name No Yes No No
business name Yes No No No
individual_name No No Yes No
DOB No No No Yes
Business name,name Yes Yes No No
The reason field is my only column and I want to create several separate columns to store the results in a binary format.
The current code looks stupid. In real data, I have more than 10 unique values for the reason column. I created 10 keywords list to store the reason keywords, and 10 empty lists to use for append ('Yes') or ('No') Sample logic:
for comment in df['reason'] :
if any(x in comment for x in keywords1):
lis1.append('Yes')
else:
lis1.append('No')
.
.
However, when scanning the value as name,
both the business_name column and name will be yes. I think because the name both exists in keywords1 and keyword2.
keywords1=['business name'] keyword2 =['name']
That's not what I want actually, I want it to be separated only if reason has values: business name, name. Not sure how to solve it and to reduce manually creating 10 lists.
Thanks in advance!
CodePudding user response:
Explanations first, final code to follow
Get a list of true unique reasons. You can drop anycon NA's here by use of the dropna()
import pandas as pd
from itertools import chain
# you can probably skip this list if you already have the dataframe
reasons = [
'business name' ,
np.nan,
'name' ,
'business name' ,
'individual_name' ,
'DOB' ,
'Business name,name']
df = pd.DataFrame(reasons)
df.columns=['reason']
unique_reasons = pd.unique(df.reason.dropna()).tolist()
# get any item that has a comma, and split it into separate pieces
splits = [x.split(',') for x in unique_reasons if ',' in x]
#take out all the items you just split from the main list
unique_reasons =[y for y in unique_reasons if ',' not in y]
# combine the two lists, and make sure that each item in final combined list is only in there one time
new_list = unique_reasons list(chain.from_iterable(splits))
unique_reasons_set = set(new_list)
Make a boolean mask for every item in the unique_reason_set, if the df['reason'] contains that item as a string, then write True, otherwise, False.
import numpy as np
new_cols = []
for item in unique_reasons_set:
col = np.where(df['reason'].str.contains(item), True, False)
print(col)
new_cols.append(col)
Take all of those new columns and concatenate them to original dataframe
df2 = pd.DataFrame.from_dict(dict(zip(unique_reasons_set, new_cols)))
df = pd.concat([df,df2], axis=1)
Complete Code
import pandas as pd
import numpy as np
from itertools import chain
reasons = [
'business name' ,
np.nan,
'name' ,
'business name' ,
'individual_name' ,
'DOB' ,
'Business name,name']
df = pd.DataFrame(reasons)
df.columns=['reason']
unique_reasons = pd.unique(df.reason.dropna()).tolist()
# get any item that has a comma, and split it into separate pieces
splits = [x.split(',') for x in unique_reasons if ',' in x]
#take out all the items you just split from the main list
unique_reasons =[y for y in unique_reasons if ',' not in y]
# combine the two lists, and make sure that each item in final combined list is only in there one time. Need the chain to flatten a 2d list that results from split
new_list = unique_reasons list(chain.from_iterable(splits))
unique_reasons_set = set(new_list)
new_cols = []
for item in unique_reasons_set:
col = np.where(df['reason'].str.contains(item), True, False)
print(col)
new_cols.append(col)
df2 = pd.DataFrame.from_dict(dict(zip(unique_reasons_set, new_cols)))
df = pd.concat([df,df2], axis=1)