Comparing words in a dataframe row to keys of a dictionary-CodePudding

I have a dataframe:

import pandas as pd
test_df = pd.DataFrame({
'_id': ['1a','2b','3c','4d'],
'column': ['und der in zu',
            'Kompliziertereswort something',
            'Lehrerin in zu [Buch]',
            'Buch (Lehrerin) kompliziertereswort']})

and a dictionary:

{'und': 20,
 'der': 10,
 'in':  40,
 'zu':  10,
 'Kompliziertereswort': 2,
 'Buch': 5,
 'Lehrerin': 5}

I want to add a new column to the dataframe that represents the average of the values of the words in that row. If the word is not in that dictionary, one should just ignore it.

_id       column                                   score
1a       und der in zu                            20
2b       Kompliziertereswort something            2
3c       Lehrerin in zu [Buch]                    15
4d       Buch (Lehrerin) kompliziertereswort      5

I did something very silly I think, i.e.: write the dataframe as a text file, read every line; I made a list of all the keys in the dictionary and then used regex to check whether that line contains the words. It didn't work because of the brackets, probably.

I also tried to split the dataframe row, but it just splits it into separate letters:

for index, values in test_df.iterrows():
        pos = 1
        for x in values[1]:
            print(pos, x)
            pos  = 1

CodePudding user response：

first you should convert your dictionary to a dataframe:

d = {'und': 20, 'der': 10,'in':  40, 'zu':  10,'Kompliziertereswort': 2, 'Buch': 5,'Lehrerin': 5}
d_df = pd.DataFrame({'column':[k for k in d],'number':[d[k] for k in d]})
d_df

column          number
und                20
der                10
in                 40
zu                 10
Kompliziertereswort 2
Buch                5
Lehrerin            5

then use explode() function of pandas to separate words of column and join it with d_df:

test_df2 = test_df.set_index(['_id']).apply(lambda x: x.str.split(' ').explode()).reset_index()
test_df2 = pd.merge(test_df2,d_df,on='column',how='left')

_id column            number
1a  und                20.0
1a  der                10.0
1a  in                 40.0
1a  zu                 10.0
2b  Kompliziertereswort 2.0
2b  something           nan
3c  Lehrerin            5.0
3c  in                  40.0
3c  zu                  10.0
3c  [Buch]              nan
4d  Buch                5.0
4d  (Lehrerin)          nan
4d  kompliziertereswort nan

calculate avg per _id:

row_means = test_df2.groupby('_id')['number'].agg(['mean']).reset_index()

_id   mean
1a  20.000000
2b  2.000000
3c  18.333333
4d  5.00000

now you can join row_means to main dataframe (test_df) and add mean column to it.

pd.merge(test_df,row_means,on='_id',how='left')

_id column                              mean
1a  und der in zu                       20.000000
2b  Kompliziertereswort something       2.000000
3c  Lehrerin in zu [Buch]               18.333333
4d  Buch (Lehrerin) kompliziertereswort 5.000000

CodePudding user response：

We can construct a regular expression pattern using the keys from the dictionary, then extract all the occurrences of this pattern from each row, then map the scores from the dictionary d to the matching strings and take mean on level=0 to get the average

pat = fr"\b({'|'.join(d)})\b"
test_df['score'] = test_df['column'].str.extractall(pat)[0].map(d).mean(level=0)

Result

print(test_df)

  _id                               column  score
0  1a                        und der in zu   20.0
1  2b        Kompliziertereswort something    2.0
2  3c                Lehrerin in zu [Buch]   15.0
3  4d  Buch (Lehrerin) kompliziertereswort    5.0