Home > front end >  Remove word not in dictionary dictionary
Remove word not in dictionary dictionary

Time:12-09

I have a data table containing tuples of words from an online review. It contains too many typos so I'm trying to erase words that do not belong to the dictionary. The dictionary I'm trying to use is KBBI (Indonesian Dictionary) https://pypi.org/project/kbbi/, imported from...

pip install kbbi
from kbbi import KBBI

I have trouble matching my data with the dictionary as I am not familiar with its data type. The function I found from the original resource shows it allows us to search a word at it will return the definition. I will only need to search within the dictionary (or maybe other way is to extract all text inside the dictionary in txt file). Here's an example of input...

# trying to look for "anjing" in the dictionary. Anjing is Indonesian for dog.    
anjing = KBBI('anjing')
print (anjing)

And its output

an.jing
1. (n)  mamalia yang biasa dipelihara untuk menjaga rumah, berburu, dan sebagainya 〔Canis familiaris〕
2. (n)  anjing yang biasa dipelihara untuk menjaga rumah, berburu, dan sebagainya 〔Canis familiaris〕

This is how I expect my result would look like (notice the word in bold is removed because it is not in the dictionary) ...

before after
[masih, blom, cair, jugagmn, in] [masih, cair]
[alhmdllh, sangat, membantu, meski, bunga, cukup, besar] [alhmdllh, sangat, membantu, meski, bunga, cukup, besar]

Here is what I've tried so far...

def remove_typo(text):
    text = [word for word in text if word in KBBI]
    return text

df['after'] = df['before'].apply(lambda x: remove_typo(x))

I got an error saying "argument of type 'type' is not iterable" on 2nd line.

CodePudding user response:

I check docs for kbbi and solution is changed with try-except:

from kbbi import KBBI, TidakDitemukann 

L = [['masih', 'blom', 'cair', 'jugagmn', 'in'], 
     ['alhmdllh', 'sangat', 'membantu', 'meski', 'bunga', 'cukup', 'besar']]

df = pd.DataFrame({'before':L})

def remove_typo(text):
    out = []
    for word in text:
        try:
            if KBBI (word):
                out.append(word)
        except TidakDitemukan:
                pass
    return out

df['after'] = df['before'].apply(remove_typo)

print (df)
                                              before  \
0                   [masih, blom, cair, jugagmn, in]   
1  [alhmdllh, sangat, membantu, meski, bunga, cuk...   

                                            after  
0                                   [masih, cair]  
1  [sangat, membantu, meski, bunga, cukup, besar]  

CodePudding user response:

text=[word in text if word in BKKI]

CodePudding user response:

First, ensure you should really use word in KBBI and not word in table.

If this is correct, then the error comes from the Series, you could modify your function to return immediately if the value is incorrect:

def remove_typo(text):
    if isinstance(text, list): 
        text = [word for word in text if word in KBBI] # should this be "table"?
        # text = [word for word in text if word in table]
    return text

df['after'] = df['before'].apply(remove_typo)

Or use an exception:

def remove_typo(text):
    try: 
        return [word for word in text if word in KBBI]
    except ValueError: # use the correct error here 
        return text

df['after'] = df['before'].apply(remove_typo)
  • Related