Home > database >  Assigning values of words in a dictionary to dataframe contents
Assigning values of words in a dictionary to dataframe contents

Time:09-21

The following is an example of a dataframe, a dictionary, and a code that works, but is extremely inefficient for huge dictionaries. One of the columns of the dataframe contains sentences. The code takes every word from the sentence, checks if it's in the dictionary and assigns its value to it. The mean of the values (per sentence, or row) is calculated and saved in an extra column.

import pandas as pd
    test_df = pd.DataFrame({
    '_id': ['1a','2b','3c','4d'],
    'column': ['und der in zu',
                'Kompliziertereswort something',
                'Lehrerin in zu [Buch]',
                'Buch (Lehrerin) kompliziertereswort']})

{'und': 20,
     'der': 10,
     'in':  40,
     'zu':  10,
     'Kompliziertereswort': 2,
     'Buch': 5,
     'Lehrerin': 5}

pat = fr"\b({'|'.join(map(re.escape, d))})\b"
test_df['score'] = test_df['column'].str.extractall(pat)[0].map(d).mean(level=0)

print(test_df)

  _id                               column  score
0  1a                        und der in zu   20.0
1  2b        Kompliziertereswort something    2.0
2  3c                Lehrerin in zu [Buch]   15.0
3  4d  Buch (Lehrerin) kompliziertereswort    5.0

Since going through a dictionary is more efficient than using regex, I believe there has to a way to do it using a function that splits the sentences into words and just checks if they're in the dictionary and computes the average. I also transformed the dictionary into a dataframe and used explode(), but again that is not efficient at all.

CodePudding user response:

Use:

import string

test_df['score'] = (test_df['column'].str.split(expand=True)
                                     .stack()
                                     .str.strip(string.punctuation)
                                     .map(d)
                                     .groupby(level=0)
                                     .mean())
print(test_df)
  _id                               column  score
0  1a                        und der in zu   20.0
1  2b        Kompliziertereswort something    2.0
2  3c                Lehrerin in zu [Buch]   15.0
3  4d  Buch (Lehrerin) kompliziertereswort    5.0

Or:

f = lambda x: np.nanmean([d.get(y, np.nan) for y in x.split()])
test_df['score'] = test_df['column'].str.replace('[^\w\s]','', regex=True).apply(f)
  _id                               column  score
0  1a                        und der in zu   20.0
1  2b        Kompliziertereswort something    2.0
2  3c                Lehrerin in zu [Buch]   15.0
3  4d  Buch (Lehrerin) kompliziertereswort    5.0

CodePudding user response:

You can try:

test_df['score'] = test_df['column'].str.split(r'\W').explode() \
                                    .map(d).groupby(level=0).mean()

Output:

>>> test_df

  _id                               column  score
0  1a                        und der in zu   20.0
1  2b        Kompliziertereswort something    2.0
2  3c                Lehrerin in zu [Buch]   15.0
3  4d  Buch (Lehrerin) kompliziertereswort    5.0
  • Related