I have a dataframe:
import pandas as pd
test_df = pd.DataFrame({
'_id': ['1a','2b','3c','4d'],
'column': ['und der in zu',
'Kompliziertereswort something',
'Lehrerin in zu [Buch]',
'Buch (Lehrerin) kompliziertereswort']})
and a dictionary:
{'und': 20,
'der': 10,
'in': 40,
'zu': 10,
'Kompliziertereswort': 2,
'Buch': 5,
'Lehrerin': 5}
I want to add a new column to the dataframe that represents the average of the values of the words in that row. If the word is not in that dictionary, one should just ignore it.
_id column score
1a und der in zu 20
2b Kompliziertereswort something 2
3c Lehrerin in zu [Buch] 15
4d Buch (Lehrerin) kompliziertereswort 5
I did something very silly I think, i.e.: write the dataframe as a text file, read every line; I made a list of all the keys in the dictionary and then used regex to check whether that line contains the words. It didn't work because of the brackets, probably.
I also tried to split the dataframe row, but it just splits it into separate letters:
for index, values in test_df.iterrows():
pos = 1
for x in values[1]:
print(pos, x)
pos = 1
CodePudding user response:
first you should convert your dictionary to a dataframe:
d = {'und': 20, 'der': 10,'in': 40, 'zu': 10,'Kompliziertereswort': 2, 'Buch': 5,'Lehrerin': 5}
d_df = pd.DataFrame({'column':[k for k in d],'number':[d[k] for k in d]})
d_df
column number
und 20
der 10
in 40
zu 10
Kompliziertereswort 2
Buch 5
Lehrerin 5
then use explode() function of pandas to separate words of column and join it with d_df:
test_df2 = test_df.set_index(['_id']).apply(lambda x: x.str.split(' ').explode()).reset_index()
test_df2 = pd.merge(test_df2,d_df,on='column',how='left')
_id column number
1a und 20.0
1a der 10.0
1a in 40.0
1a zu 10.0
2b Kompliziertereswort 2.0
2b something nan
3c Lehrerin 5.0
3c in 40.0
3c zu 10.0
3c [Buch] nan
4d Buch 5.0
4d (Lehrerin) nan
4d kompliziertereswort nan
calculate avg per _id:
row_means = test_df2.groupby('_id')['number'].agg(['mean']).reset_index()
_id mean
1a 20.000000
2b 2.000000
3c 18.333333
4d 5.00000
now you can join row_means to main dataframe (test_df) and add mean column to it.
pd.merge(test_df,row_means,on='_id',how='left')
_id column mean
1a und der in zu 20.000000
2b Kompliziertereswort something 2.000000
3c Lehrerin in zu [Buch] 18.333333
4d Buch (Lehrerin) kompliziertereswort 5.000000
CodePudding user response:
We can construct a regular expression pattern using the keys from the dictionary, then extract all the occurrences of this pattern from each row, then map
the scores from the dictionary d
to the matching strings and take mean
on level=0
to get the average
pat = fr"\b({'|'.join(d)})\b"
test_df['score'] = test_df['column'].str.extractall(pat)[0].map(d).mean(level=0)
Result
print(test_df)
_id column score
0 1a und der in zu 20.0
1 2b Kompliziertereswort something 2.0
2 3c Lehrerin in zu [Buch] 15.0
3 4d Buch (Lehrerin) kompliziertereswort 5.0