I have a SQL database of information about the users of a certain platform. The database has two columns: username, description
I also have a list of words/expressiosn (about 200 words in total) that I want to check if they exist in the description of each user:
words = ['python', 'css', 'html', ...]
what I want to do, is to create a new column - for instance named 'total'- and then count the total number of words/expressions in the list that are used in each user's description.
In other words, this is what I want using a nested for-loop:
for user in users:
for word in words:
if word in user.description:
user.total = 1
However, the size of my data is quite big (5 million users) and I want to know if there are more efficient ways to achieve this goal. I prefer to do it with SQL, however Python Pandas default functions would also be helpful.
The end result should look something like this:
username | description | total |
---|---|---|
afhkjh | Python Nerd, Swimming | 1 |
vnjfnn | Conservative, HTML Developer | 1 |
af5a45 | NA | 0 |
afkjah | Love working with CSS and HTML | 2 |
CodePudding user response:
Data:
>>> import pandas as pd
>>> words = ['python', 'css', 'html']
>>> df = pd.DataFrame({'username': {0: 'afhkjh', 1: 'vnjfnn', 2: 'af5a45', 3: 'afkjah'},
'description': {0: 'Python Nerd, Swimming',
1: 'Conservative, HTML Developer',
2: 'NA',
3: 'Love working with CSS and HTML'}})
>>> df
username description
0 afhkjh Python Nerd, Swimming
1 vnjfnn Conservative, HTML Developer
2 af5a45 NA
3 afkjah Love working with CSS and HTML
>>> df['total'] = df.description.str.lower().apply(str.split).apply(lambda x:len(set(words) & set(x)))
>>> df
username description total
0 afhkjh Python Nerd, Swimming 1
1 vnjfnn Conservative, HTML Developer 1
2 af5a45 NA 0
3 afkjah Love working with CSS and HTML 2
CodePudding user response:
count = []
for user in users:
count.append(sum(word in user.description for word in words))
df["Total"] = Count
There could be better ways. The following block took 3.6s to run for me
words = ["quick","lazy"]
temp = []
for i in range(5000000):
temp.append(sum(word in "The quick brown fox jumps over the lazy dog" for word in words))
You can try to switch the other way round if user.description
is shorter than list of words using user.description.split(" ")
Also, consider using Set
if words are heavily repeated in user.description