Home > database >  String search on dataframe using key/value from dict
String search on dataframe using key/value from dict

Time:03-26

I am trying to match the string present in the below dataframe's 'Disease' column with the key from the dict and if the string is present then change the value in the 'category' column to the value of the dict's key.

df =

Year category Pollutant Disease DiseaseCaseCount Industry
2016 null Pb hypertension 1025 b_battery_ltd
2016 null PM25 lung cancer 180 t_chemicals
2016 null PM25 lung cancer 180 t_powerplant
2016 null Cu lung cancer 200 b_miners
2016 null Cu lung cancer 200 a_preservative_pvt
2016 null PM25 acute bronchitis 367 t_chemicals
2016 null PM25 acute bronchitis 367 t_powerplant

and a dict

my_dict = {"cancer": 2, "brain tumor": 8, "acute bronchitis":3}

What I have tried till now is

for x in my_dict:
    for row in df.itertuples(index=True, name='Pandas'):
        searchText = row.text
        #print(type(searchText))
        if (searchText.str.lower().str.contains(x).any()):
            row.class = my_dict[x]
        else:
             row.class = None
  
display(df)

It throws an error :

AttributeError: 'str' object has no attribute 'str'

Final dataframe which I'm looking at is

df =

 ---- ---- --------- ---------------- ---------------- ------------------------ 
|Year|category|Pollutant|       Disease  |DiseaseCaseCount|          Industry|
 ---- ---- --------- ---------------- ---------------- ------------------------ 
|2016|   null |       Pb|    hypertension|            1025|     b_battery_ltd|
|2016|   2    |     PM25|     lung cancer|             180|       t_chemicals|
|2016|   2    |     PM25|     lung cancer|             180|      t_powerplant|
|2016|   2    |       Cu|     lung cancer|             200|          b_miners|
|2016|   2    |       Cu|     lung cancer|             200|a_preservative_pvt|
|2016|   3    |     PM25|acute bronchitis|             367|       t_chemicals|
|2016|   3   |     PM25|acute bronchitis|             367|      t_powerplant|
 ---- ---- --------- ---------------- ---------------- ------------------------ 

CodePudding user response:

Here's one way using list comprehension that iterates over the values in the Disease column and use next and a generator expression to get the dict value if there's a match:

df['category'] = [next((v for k,v in my_dict.items() if k in x), float('nan')) for x in df['Disease'].tolist()]

Output:

   Year  category Pollutant           Disease  DiseaseCaseCount              Industry
0  2016       NaN        Pb      hypertension              1025         b_battery_ltd
1  2016       2.0      PM25       lung cancer               180           t_chemicals
2  2016       2.0      PM25       lung cancer               180          t_powerplant
3  2016       2.0        Cu       lung cancer               200              b_miners
4  2016       2.0        Cu       lung cancer               200    a_preservative_pvt
5  2016       3.0      PM25  acute bronchitis               367           t_chemicals
6  2016       3.0      PM25  acute bronchitis               367          t_powerplant

CodePudding user response:

Try and utilize pandas apply(). It's usually much more readable and concise. I am sure there is a more performant way to do it with vectorized function but this way is much more intuitive.

def change_class(row, my_dict={"cancer": 2, "brain tumor": 3, "acute bronchitis":8}):
    for key, value in my_dict.items():
        if key == row['Disease']:
            return value
        return row['category']

df['category'] = df.apply(lambda x: change_class(x), axis=1)
  • Related