I have keywords in my keyword.txt list and I have domains in my domain.txt list. I want to generate a score with a check. For example, if the keyword is apple, it will be 30 points, and I will search for it in all domains and give points for each domain. How do I do that ?
My code:
score_dict = {"apple":"30",
"bananas":"50"}
def generate_score():
with open("keyword.txt", "r") as file:
keywords = file.read().splitlines()
with open("domain.txt", 'r') as g:
score = []
for line in g:
if any(keyword in line.strip() for keyword in keywords):
score.append(line)
keyword.txt:
apple #if the domain contains the word apple, it will get a 30 score
domain.txt:
apple.com
redapple.com
paple.com
The output I want:
apple.com , 30
redapple.com, 30
paple.com, 0
CodePudding user response:
Here is an iterative approach that loops the lines in domain.txt
and searches for partial string matches from the dictionary:
score_dict = {"apple":"30",
"bananas":"50"}
results = []
with open("domain.txt", 'r') as g:
for domain in g.readlines(): #loop lines in domain.txt
hit = None
for substring, score in score_dict.items(): #loop score_dict
if substring in domain:
hit = True
results.append({'domain': domain.strip(), 'substring':substring, 'score': score})
break #break on hit to avoid unnecessary iterations
if not hit: #assign score 0 if there is no hit
results.append({'domain': domain.strip(), 'substring':substring, 'score': 0})
Output:
[{'domain': 'apple.com', 'substring': 'apple', 'score': '30'},
{'domain': 'redapple.com', 'substring': 'apple', 'score': '30'},
{'domain': 'paple.com', 'substring': 'bananas', 'score': 0}]
Note that this solution can be slow when working with large documents. In that case you could vectorize the issue using pandas
with str.extract
:
import pandas as pd
score_dict = {"apple":"30",
"bananas":"50"}
df = pd.read_csv('domain.txt', names=['string']) #read domain.txt as pandas DataFrame
df['score']=(df['string'].str.extract('(' '|'.join(score_dict.keys()) ')',expand=False).map(score_dict)).fillna(0)
df.to_csv('output.csv') # save output to csv
Results:
string | score | |
---|---|---|
0 | apple.com | 30 |
1 | redapple.com | 30 |
2 | paple.com | 0 |