The following is an example of a dataframe, a dictionary, and a code that works, but is extremely inefficient for huge dictionaries. One of the columns of the dataframe contains sentences. The code takes every word from the sentence, checks if it's in the dictionary and assigns its value to it. The mean of the values (per sentence, or row) is calculated and saved in an extra column.
import pandas as pd
test_df = pd.DataFrame({
'_id': ['1a','2b','3c','4d'],
'column': ['und der in zu',
'Kompliziertereswort something',
'Lehrerin in zu [Buch]',
'Buch (Lehrerin) kompliziertereswort']})
{'und': 20,
'der': 10,
'in': 40,
'zu': 10,
'Kompliziertereswort': 2,
'Buch': 5,
'Lehrerin': 5}
pat = fr"\b({'|'.join(map(re.escape, d))})\b"
test_df['score'] = test_df['column'].str.extractall(pat)[0].map(d).mean(level=0)
print(test_df)
_id column score
0 1a und der in zu 20.0
1 2b Kompliziertereswort something 2.0
2 3c Lehrerin in zu [Buch] 15.0
3 4d Buch (Lehrerin) kompliziertereswort 5.0
Since going through a dictionary is more efficient than using regex, I believe there has to a way to do it using a function that splits the sentences into words and just checks if they're in the dictionary and computes the average. I also transformed the dictionary into a dataframe and used explode(), but again that is not efficient at all.
CodePudding user response:
Use:
import string
test_df['score'] = (test_df['column'].str.split(expand=True)
.stack()
.str.strip(string.punctuation)
.map(d)
.groupby(level=0)
.mean())
print(test_df)
_id column score
0 1a und der in zu 20.0
1 2b Kompliziertereswort something 2.0
2 3c Lehrerin in zu [Buch] 15.0
3 4d Buch (Lehrerin) kompliziertereswort 5.0
Or:
f = lambda x: np.nanmean([d.get(y, np.nan) for y in x.split()])
test_df['score'] = test_df['column'].str.replace('[^\w\s]','', regex=True).apply(f)
_id column score
0 1a und der in zu 20.0
1 2b Kompliziertereswort something 2.0
2 3c Lehrerin in zu [Buch] 15.0
3 4d Buch (Lehrerin) kompliziertereswort 5.0
CodePudding user response:
You can try:
test_df['score'] = test_df['column'].str.split(r'\W').explode() \
.map(d).groupby(level=0).mean()
Output:
>>> test_df
_id column score
0 1a und der in zu 20.0
1 2b Kompliziertereswort something 2.0
2 3c Lehrerin in zu [Buch] 15.0
3 4d Buch (Lehrerin) kompliziertereswort 5.0