Home > Mobile >  Fuzzy Process extract one giving different result
Fuzzy Process extract one giving different result

Time:10-23

I have a data frame and I am trying to map one of column values to values present in a set.

Data frame is

Name   CallType    Location
ABC     IN          SFO
DEF     OUT         LHR
PQR     INCOMING    AMS
XYZ     OUTGOING    BOM
TYR     A_IN        DEL
OMN     A_OUT       DXB

I have a Constant list where Call Type will be replaced by that in the list

call_type = set("IN","OUT")

Desired data frame

Name   CallType    Location
ABC     IN         SFO
DEF     OUT        LHR
PQR     IN         AMS
XYZ     OUT        BOM
TYR     IN         DEL
OMN     OUT        DXB

I wrote the code to check the response but the process.extractOne gives IN for OUTGOING sometimes (Which is wrong) and sometimes it gives OUT for OUTGOING (Which is right)

Here's is my code

data=[('ABC','IN','SFO),
('DEF','OUT','LHR),
('PQR','INCOMING','AMS),
('XYZ','OUTGOING','BOM),
('TYR','A_IN','DEL),
('OMN','A_OUT','DXB)]

df = pd.DataFrame(data,
                columns =['Name', 'CallType',
                'Location'])

call_types=set(['IN','OUT'])

df['Call Type'] = df['Call Type'].apply(lambda x: process.extractOne(x, list(call_types))[0])


total_rows=len(df)

for row_no in range(total_rows):
        row=df.iloc[row_no]
        print(row) // Here Sometimes OUTGOING sets as OUT and Sometimes IN . Shouldn't the result be consistent ? 

I am not sure if there is a better way. Can someone please suggest if I am missing something.

CodePudding user response:

Looks like Series.str.extract is a good fit for this:

df['CallType'] = df.CallType.str.extract(r'(OUT|IN)')

print(df)

  Name CallType Location
0  ABC       IN      SFO
1  DEF      OUT      LHR
2  PQR       IN      AMS
3  XYZ      OUT      BOM
4  TYR       IN      DEL
5  OMN      OUT      DXB

Or, if you want to use call_types explicitly, you can do:

df['CallType'] = df.CallType.str.extract(fr"({'|'.join(call_types)})")

# same result

CodePudding user response:

A possible solution is to use difflib.get_close_matches:

import difflib

df['CallType'] = df['CallType'].apply(
    lambda x: difflib.get_close_matches(x, call_type)[0])

Output:

  Name CallType Location
0  ABC       IN      SFO
1  DEF      OUT      LHR
2  PQR       IN      AMS
3  XYZ      OUT      BOM
4  TYR       IN      DEL
5  OMN      OUT      DXB

Another possible solution:

df['CallType'] = np.where(df['CallType'].str.contains('OUT'), 'OUT', 'IN')

Output:

# same
  • Related