I have dataframe with following columns: [company_name, company_sector, company_country]
There are 10 unique sectors: Business services, Finance Services, Technology etc. This is how it looks like : enter image description here
on the other hand I have a list of keywords = ['services', 'holdings', 'group', 'manufacture'] etc
I am looking for a way to check how many times each keyword occurs in company_name and assign it to company_sector like that: enter image description here
meaning : if there is a company "Atlantic Navigation Holdings (S) Limited" and it belongs to sector Industrials - then industrials will have a count 1 for keyword holdings (I already changed everything to lowercase - both keywords and company name)
if there is a company "Atlantic Navigation Holdings (S) Limited" and it belongs to sector Industrials - then industrials will have a count 1 for keyword holdings (I already changed everything to lowercase - both keywords and company name)
CodePudding user response:
You can use groupby from pandas [1] to select each sector. Based on each sector you can count the occurrences of a keyword in a for loop.
I used the default dictionary [2] to create this new dataframe.
[1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
[2] https://docs.python.org/3/library/collections.html#collections.defaultdict
import pandas as pd
from collections import defaultdict
# dictionary with fake data
d = {'sector': ['one', 'one', 'two', 'two' , 'one'], 'name': ['a', 'b', 'b', 'a', 'b']}
# convert dictionary to pandas DataFrame
df = pd.DataFrame(d)
sector name
0 one a
1 one b
2 two b
3 two a
4 one b
keywords = ['a', 'b', 'c']
# create empty dictionary
new_d = defaultdict(list)
for key, group in df.groupby('sector'):
for k in keywords:
new_d[key].append(sum(group['name'].str.contains(k)))
pd.DataFrame(new_d, index=keywords)
one two
a 1 1
b 2 1
c 0 0
In this case the keywords are as index in the new dataframe and the columns are the sectors.
CodePudding user response:
first create a new dataframe skeleton and fill it with 0:
counts_df = pd.DataFrame(columns=keywords, index=df['comapny_sector'].unique()) counts_df = counts_df.fillna(0)
Iter through a dataframe and check if keyword is in company_name, if it exists - add to the df:
for _, row in train_df.iterrows(): for keyword in keywords: if keyword in row['company_name']: counts_df.loc[row['company_sector'], keyword] = 1
counts_df