I have a df that looks similar to the below but with 100s of columns, where each column references a different text resolution.
Index | A/RES/73/262 | A/RES/73/263 |
---|---|---|
Issue | NaN | HR |
Description | Protection of the Palestinian civilian | Situation of human rights in Myanmar |
The "Issue" row contains mostly NaN values but occasionally has "HR" manually filled in for when "human rights" are mentioned in the description.
I would like to automate filling values in the "Issue" row process for a series of general terms using re.search. I am trying to figure out how to iterate my function over columns so that whenever a match is found for a column in the description cell, the value in the Issue code is changed according to a dictionary. However, I am having trouble with getting the for loop correct.
Below is the code I am working with so far. However, I get TypeErrors: expected string or bytes-like object since I am not passing a list of strings but a set of columns.
issue_dict = {'human rights':'HR', 'nuclear':'NU', 'sustainable development':'SD'}
for (columnName, columnData) in issues_df.iteritems():
for key in issue_dict.keys():
search_keys = re.search(rf"(?i){key}", issues_df.iloc[[1]])
if search_keys != None:
issues_df = issues_df.replace({issues_df.iloc[[0]]: issue_dict})
else:
issues_df.iloc[[0]] = pd.np.NaN
CodePudding user response:
First, it looks like your dataset would benefit from being transposed:
df.set_index('Index').T
output:
Index Issue Description
A/RES/73/262 NaN Protection of the Palestinian civilian
A/RES/73/263 HR Situation of human rights in Myanmar
Then you can easily use your data as columns:
(df.set_index('Index').T
.assign(Issue=lambda d: (d['Description'].str.contains('human rights')
.map({True: 'HR', False: float('nan')}))
)
#.T.reset_index() ## uncomment if you want the original wide format
)
output (long):
Index Issue Description
A/RES/73/262 NaN Protection of the Palestinian civilian
A/RES/73/263 HR Situation of human rights in Myanmar
output (wide):
Index A/RES/73/262 A/RES/73/263
0 Issue NaN HR
1 Description Protection of the Palestinian civilian Situation of human rights in Myanmar
CodePudding user response:
I think you're over-complicating the use of re.search
.
You could create a pattern from your dictionary keys, then look for this pattern and set the Issue
row accordingly:
issue_dict = {'human rights':'HR', 'nuclear':'NU', 'sustainable development':'SD'}
key_pattern = re.compile(rf'({"|".join(issue_dict.keys())})')
for columnName, columnData in issues_df.iteritems():
matched_key = re.search(key_pattern, columnData[1])
if matched_key:
columnData[0] = issue_dict.get(matched_key.group(), pd.np.NaN)
else:
columnData[0] = pd.np.NaN
print(issues_df)