How to map for range values in Python Pandas-CodePudding

I have a dataset like below:

Customer_ID Policy_No   Age Channel Status  Income
123 5432    5   DD  Lapse   300000
123 5431    1   Agency  Active  420000
234 4352    7   DD  Lapse   650000

and a master dataset:

Group   Attributes  S2SFlag
Age 1 to 5  Yes
Age 6 to 10 No
Channel DD  Yes
Channel Agency  Yes
Status  Lapse   Yes
Status  Active  No
Income  300000 to 500000    Yes
Income  500000 to 700000    No

Here, the condition is to check the associated value for the Group and its value from the dataset with the master dataset and pick the S2SFlag column value from the master dataset and create an S2SFlag column in the dataset. The original dataset is returned by querying the database using AWS Lambda.

Also, if post-checking if there is a 'No' value for any of the Group attributes/row then the Flag for that particular group would be 'No'. The expected output is as below:

I have tried the below code but it is giving me the error:

import pandas as pd
import re

df1 = pd.read_csv('')  # Master table


def to_range(s):
    m = re.match('(\d ) to (\d )$', s)
    return range(int(m.group(1)), int(m.group(2))   1) if m else s


df1 = (df1.assign(Attributes=[to_range(s) for s in df1['Attributes']])
       .explode('Attributes'))

df2 = pd.read_csv('')  # Assuming it as query results

mapping_Age = df1[df1['Group'] == 'Age'].set_index('Attributes')['S2SFlag'] == 'Yes'

mapping_Channel = df1[df1['Group'] == 'Channel'].set_index('Attributes')['S2SFlag'] == 'Yes'

mapping_Status = df1[df1['Group'] == 'Status'].set_index('Attributes')['S2SFlag'] == 'Yes'

mapping_Income = df1[df1['Group'] == 'Income'].set_index('Attributes')['S2SFlag'] == 'Yes'
mapped_Age = df2['Age'].map(mapping_Age)
mapped_Channel = df2['Channel'].map(mapping_Channel)
mapped_Status = df2['Status'].map(mapping_Status)
mapped_Income = df2['Income'].map(mapping_Income)

df2['S2SFlag'] = (mapped_Age & mapped_Status & mapped_Channel & mapped_Income).map({True: 'Yes', False: 'No'})  

Traceback (most recent call last):
  File "", line 27, in <module>
    mapped_Income = df2['Income'].map(mapping_Income)
  File "", line 4161, in map
    new_values = super()._map_values(arg, na_action=na_action)
  File "", line 842, in _map_values
    indexer = mapper.index.get_indexer(values)
  File "", line 3442, in get_indexer
    raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Also, I tried something like below which also gave me another error:

import pandas as pd
import re

df1 = pd.read_csv('') #Master table
df2 = pd.read_csv('') #Dataset
df1 = df1[df1['Group'] == 'Income']

for i, j in df2['Income'].iteritems():
for k, l in zip(df1['Attributes'], df1['S2SFlag']):
    start = k.split(" to ")[0]
    end = k.split(" to ")[1]
    if j in range(int(start), int(end)):
        print("J ", j, " l ", l)
        df2['S2SFlag'] = l

Output:

  Customer_ID  Policy_No  Age Channel  Status  Income S2SFlag
0          123       5432    5      DD   Lapse  300000      No
1          123       5431    1  Agency  Active  420000      No
2          234       4352    7      DD   Lapse  650000      No

Where am I going wrong?

Update 1:

As per Raymond, everything worked like a charm except for one scenario where I am not getting the output as expected.

Master Data:

Group   Attributes  S2SFlag
Insured Age Group   1 to 20 Yes
Insured Age Group   21 to 25    Yes
Insured Age Group   26 to 30    Yes
Insured Age Group   31 to 40    Yes
Insured Age Group   41 to 50    Yes
Insured Age Group   51 to 100   Yes
Policy Status   Lapse   Yes
Policy Status   Active  Yes
Policy Status   RPU No
Policy Status   Paid Up/ PPT Over   Yes
Policy Status   Claim settled   Yes
Policy Status   Claim rejected  No
Income Group    0 to 300000 Yes
Income Group    300001 to 500000    Yes
Income Group    500001 to 700000    Yes
Income Group    700001 to 1000000   Yes
Income Group    1000001 to 1500000  Yes
Income Group    1500001 to 1600000  Yes
Marrital status Single  Yes
Marrital status Married Yes
Marrital status Divorced    Yes
Marrital status Married with a child    Yes
Marrital status Single Parent   Yes
Marrital status 2 children  Yes
Occupation  Salaried    Yes
Occupation  Daily wage  No
Occupation  Self employed with <5LPA    No
Occupation  Self employed with >5LPA    Yes
Occupation  House wife  Yes
Occupation  House Husband   Yes
Education   1 to 10 Yes
Education   11 to 12    Yes
Education   Diploma Yes
Education   Graduate    Yes
Education   Post Graduate   Yes
Education   Doctoral    Yes
Fund Value  Positive    Yes
Fund Value  Negative    No
Fund Value  Zero    No

When I tried printing the mapped values for Education, it gave me results like below:

mapped_Education = df2['Education'].map(mapping_Education)
print(mapped_Education)  

0     NaN
1     NaN
2    True
Name: Education, dtype: object

Dataset:

Customer_ID Policy_No   Age Status  Income  Marital_Status  Occupation  Education   Fund Value
123 5432    5   Lapse   300000  Single  Salaried    6   Positive
123 5431    1   Active  420000  Married Daily wage  11  Zero
234 4352    7   Lapse   650000  Divorced    Daily wage  Doctoral    Zero

Due to this, I am getting output as below:

 Customer_ID  Policy_No  Age  Status  ...  Occupation Education Fund Value S2SFlag
0          123       5432    5   Lapse  ...    Salaried         6   Positive      No
1          123       5431    1  Active  ...  Daily wage        11       Zero      No
2          234       4352    7   Lapse  ...  Daily wage  Doctoral       Zero      No

Where S2SFlag for the first record should have been 'Yes'

CodePudding user response：

The error Reindexing only valid with uniquely valued Index objects says that map requires your mapping_Income to have unique index, and the index originates from df1[df1['Group'] == 'Income'].set_index('Attributes') where the 'Attributes' for Income contains exploded values that are not unique.

Your code for exploding is range inclusive, so when you explode the two ranges for Income, 500000 will appear once in each explode, ending up with two 500000 and thus the error. Please try to modify either one of the ranges to exclude its 500000, e.g. make the second range 500001 to 700000.

For Update 1:

I suspect it has to do with data-type again. The exploded value could be integers, but all values of df2['Education'] are string.

One quick fix is to change everything into strings.

#1 add df1['Attributes'] = df1['Attributes'].astype(str) after the line for explode

#2 add df2 = df2.astype(str) before any map to make all contents strings