I feel like I'm close but I am looking for something like this where the new column writes the company name without the city in it:
company postal_code name state city \
2000-01-01 abc gresham co 97080 john mi gresham
2000-01-01 startup llc 97080 jeff hi portland
2001-01-01 beaverton business biz 99999 sam ca beaverton
2002-01-01 andy co 92222 joey or los angeles
new_col
2000-01-01 abc co
2000-01-01 startup llc
2001-01-01 business biz
2002-01-01 andy co
This is what I have so far but it's throwing a TypeError: unhashable type: 'Series'
:
for idx in df1.index:
if df1["city"].loc[idx] in df1['company'].loc[idx]:
print("figure out how to print to new column the company name without the city included")
else:
print(df1['company'].loc[idx])
Thanks!
CodePudding user response:
Here's a solution:
df = (
df.reset_index()
.assign(new_col=df.reset_index()
.pipe(lambda x: x.assign(x=x['company'].str.split(' ')))
.explode('x')
.loc[lambda x: x['x'] != x['city'], 'x']
.groupby(level=0)
.agg(list)
.str.join(' ')
)
.set_index('index')
)
Output:
>>> df
company postal_code name state city new_col
index
2000-01-01 abc gresham co 97080 john mi gresham abc co
2000-01-01 startup llc 97080 jeff hi portland startup llc
2001-01-01 beaverton business biz 99999 sam ca beaverton business biz
2002-01-01 andy co 92222 joey or los angeles andy co
One-liner:
df = df.reset_index().assign(new_col=df.reset_index().pipe(lambda x: x.assign(x=x['company'].str.split(' '))).explode('x').loc[lambda x: x['x'] != x['city'], 'x'].groupby(level=0).agg(list).str.join(' ')).set_index('index')