I am trying to match the string present in the below dataframe's 'Disease' column with the key from the dict and if the string is present then change the value in the 'category' column to the value of the dict's key.
df =
Year | category | Pollutant | Disease | DiseaseCaseCount | Industry |
---|---|---|---|---|---|
2016 | null | Pb | hypertension | 1025 | b_battery_ltd |
2016 | null | PM25 | lung cancer | 180 | t_chemicals |
2016 | null | PM25 | lung cancer | 180 | t_powerplant |
2016 | null | Cu | lung cancer | 200 | b_miners |
2016 | null | Cu | lung cancer | 200 | a_preservative_pvt |
2016 | null | PM25 | acute bronchitis | 367 | t_chemicals |
2016 | null | PM25 | acute bronchitis | 367 | t_powerplant |
and a dict
my_dict = {"cancer": 2, "brain tumor": 8, "acute bronchitis":3}
What I have tried till now is
for x in my_dict:
for row in df.itertuples(index=True, name='Pandas'):
searchText = row.text
#print(type(searchText))
if (searchText.str.lower().str.contains(x).any()):
row.class = my_dict[x]
else:
row.class = None
display(df)
It throws an error :
AttributeError: 'str' object has no attribute 'str'
Final dataframe which I'm looking at is
df =
---- ---- --------- ---------------- ---------------- ------------------------
|Year|category|Pollutant| Disease |DiseaseCaseCount| Industry|
---- ---- --------- ---------------- ---------------- ------------------------
|2016| null | Pb| hypertension| 1025| b_battery_ltd|
|2016| 2 | PM25| lung cancer| 180| t_chemicals|
|2016| 2 | PM25| lung cancer| 180| t_powerplant|
|2016| 2 | Cu| lung cancer| 200| b_miners|
|2016| 2 | Cu| lung cancer| 200|a_preservative_pvt|
|2016| 3 | PM25|acute bronchitis| 367| t_chemicals|
|2016| 3 | PM25|acute bronchitis| 367| t_powerplant|
---- ---- --------- ---------------- ---------------- ------------------------
CodePudding user response:
Here's one way using list comprehension that iterates over the values in the Disease
column and use next
and a generator expression to get the dict value if there's a match:
df['category'] = [next((v for k,v in my_dict.items() if k in x), float('nan')) for x in df['Disease'].tolist()]
Output:
Year category Pollutant Disease DiseaseCaseCount Industry
0 2016 NaN Pb hypertension 1025 b_battery_ltd
1 2016 2.0 PM25 lung cancer 180 t_chemicals
2 2016 2.0 PM25 lung cancer 180 t_powerplant
3 2016 2.0 Cu lung cancer 200 b_miners
4 2016 2.0 Cu lung cancer 200 a_preservative_pvt
5 2016 3.0 PM25 acute bronchitis 367 t_chemicals
6 2016 3.0 PM25 acute bronchitis 367 t_powerplant
CodePudding user response:
Try and utilize pandas apply(). It's usually much more readable and concise. I am sure there is a more performant way to do it with vectorized function but this way is much more intuitive.
def change_class(row, my_dict={"cancer": 2, "brain tumor": 3, "acute bronchitis":8}):
for key, value in my_dict.items():
if key == row['Disease']:
return value
return row['category']
df['category'] = df.apply(lambda x: change_class(x), axis=1)