Looking for a way to check keywords occurrences in a dataframe sector column-CodePudding

I have dataframe with following columns: [company_name, company_sector, company_country]

There are 10 unique sectors: Business services, Finance Services, Technology etc. This is how it looks like : enter image description here

on the other hand I have a list of keywords = ['services', 'holdings', 'group', 'manufacture'] etc

I am looking for a way to check how many times each keyword occurs in company_name and assign it to company_sector like that: enter image description here

meaning : if there is a company "Atlantic Navigation Holdings (S) Limited" and it belongs to sector Industrials - then industrials will have a count 1 for keyword holdings (I already changed everything to lowercase - both keywords and company name)

if there is a company "Atlantic Navigation Holdings (S) Limited" and it belongs to sector Industrials - then industrials will have a count 1 for keyword holdings (I already changed everything to lowercase - both keywords and company name)

CodePudding user response：

You can use groupby from pandas [1] to select each sector. Based on each sector you can count the occurrences of a keyword in a for loop.

I used the default dictionary [2] to create this new dataframe.

[1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

[2] https://docs.python.org/3/library/collections.html#collections.defaultdict

import pandas as pd

from collections import defaultdict

# dictionary with fake data
d = {'sector': ['one', 'one', 'two', 'two' , 'one'], 'name': ['a', 'b', 'b', 'a', 'b']}
# convert dictionary to pandas DataFrame
df = pd.DataFrame(d)
    sector  name
0   one a
1   one b
2   two b
3   two a
4   one b

keywords = ['a', 'b', 'c']

# create empty dictionary
new_d = defaultdict(list)

for key, group in df.groupby('sector'):
    for k in keywords:
        new_d[key].append(sum(group['name'].str.contains(k)))
pd.DataFrame(new_d, index=keywords)

  one   two
a   1   1
b   2   1
c   0   0

In this case the keywords are as index in the new dataframe and the columns are the sectors.

CodePudding user response：

first create a new dataframe skeleton and fill it with 0:

counts_df = pd.DataFrame(columns=keywords, index=df['comapny_sector'].unique()) counts_df = counts_df.fillna(0)
Iter through a dataframe and check if keyword is in company_name, if it exists - add to the df:

for _, row in train_df.iterrows(): for keyword in keywords: if keyword in row['company_name']: counts_df.loc[row['company_sector'], keyword] = 1

counts_df