Home > Software design >  find total occurences of a list of words in a given column (SQL/Pandas DataFrame)
find total occurences of a list of words in a given column (SQL/Pandas DataFrame)

Time:10-31

I have a SQL database of information about the users of a certain platform. The database has two columns: username, description

I also have a list of words/expressiosn (about 200 words in total) that I want to check if they exist in the description of each user:

words = ['python', 'css', 'html', ...]

what I want to do, is to create a new column - for instance named 'total'- and then count the total number of words/expressions in the list that are used in each user's description.

In other words, this is what I want using a nested for-loop:

for user in users:
    for word in words:
        if word in user.description:
            user.total  = 1

However, the size of my data is quite big (5 million users) and I want to know if there are more efficient ways to achieve this goal. I prefer to do it with SQL, however Python Pandas default functions would also be helpful.

The end result should look something like this:

username description total
afhkjh Python Nerd, Swimming 1
vnjfnn Conservative, HTML Developer 1
af5a45 NA 0
afkjah Love working with CSS and HTML 2

CodePudding user response:

Data:

>>> import pandas as pd
>>> words = ['python', 'css', 'html']
>>> df = pd.DataFrame({'username': {0: 'afhkjh', 1: 'vnjfnn', 2: 'af5a45', 3: 'afkjah'},
 'description': {0: 'Python Nerd, Swimming',
  1: 'Conservative, HTML Developer',
  2: 'NA',
  3: 'Love working with CSS and HTML'}})
>>> df

    username    description
0   afhkjh      Python Nerd, Swimming
1   vnjfnn      Conservative, HTML Developer
2   af5a45      NA
3   afkjah      Love working with CSS and HTML

>>> df['total'] = df.description.str.lower().apply(str.split).apply(lambda x:len(set(words) & set(x)))
>>> df

    username    description                     total
0   afhkjh      Python Nerd, Swimming           1
1   vnjfnn      Conservative, HTML Developer    1
2   af5a45      NA                              0
3   afkjah      Love working with CSS and HTML  2


CodePudding user response:

count = []
for user in users:
    count.append(sum(word in user.description for word in words))
df["Total"] = Count

There could be better ways. The following block took 3.6s to run for me

words = ["quick","lazy"]
temp = []
for i in range(5000000):
    temp.append(sum(word in "The quick brown fox jumps over the lazy dog" for word in words))

You can try to switch the other way round if user.description is shorter than list of words using user.description.split(" ")

Also, consider using Set if words are heavily repeated in user.description

  • Related