Home > Enterprise >  How to iterate a re.search function over columns
How to iterate a re.search function over columns

Time:11-16

I have a df that looks similar to the below but with 100s of columns, where each column references a different text resolution.

Index A/RES/73/262 A/RES/73/263
Issue NaN HR
Description Protection of the Palestinian civilian Situation of human rights in Myanmar

The "Issue" row contains mostly NaN values but occasionally has "HR" manually filled in for when "human rights" are mentioned in the description.

I would like to automate filling values in the "Issue" row process for a series of general terms using re.search. I am trying to figure out how to iterate my function over columns so that whenever a match is found for a column in the description cell, the value in the Issue code is changed according to a dictionary. However, I am having trouble with getting the for loop correct.

Below is the code I am working with so far. However, I get TypeErrors: expected string or bytes-like object since I am not passing a list of strings but a set of columns.

issue_dict = {'human rights':'HR', 'nuclear':'NU', 'sustainable development':'SD'}

for (columnName, columnData) in issues_df.iteritems():
    for key in issue_dict.keys():      
        search_keys = re.search(rf"(?i){key}", issues_df.iloc[[1]])
        if search_keys != None:
            issues_df = issues_df.replace({issues_df.iloc[[0]]: issue_dict})   
        else:
            issues_df.iloc[[0]] = pd.np.NaN

CodePudding user response:

First, it looks like your dataset would benefit from being transposed:

df.set_index('Index').T

output:

Index        Issue                             Description
A/RES/73/262   NaN  Protection of the Palestinian civilian
A/RES/73/263    HR    Situation of human rights in Myanmar

Then you can easily use your data as columns:

(df.set_index('Index').T
   .assign(Issue=lambda d: (d['Description'].str.contains('human rights')
                            .map({True: 'HR', False: float('nan')}))
          )
   #.T.reset_index() ## uncomment if you want the original wide format
)

output (long):

Index        Issue                             Description
A/RES/73/262   NaN  Protection of the Palestinian civilian
A/RES/73/263    HR    Situation of human rights in Myanmar

output (wide):

         Index                            A/RES/73/262                          A/RES/73/263
0        Issue                                     NaN                                    HR
1  Description  Protection of the Palestinian civilian  Situation of human rights in Myanmar

CodePudding user response:

I think you're over-complicating the use of re.search.
You could create a pattern from your dictionary keys, then look for this pattern and set the Issue row accordingly:

issue_dict = {'human rights':'HR', 'nuclear':'NU', 'sustainable development':'SD'}
key_pattern = re.compile(rf'({"|".join(issue_dict.keys())})')

for columnName, columnData in issues_df.iteritems():
    matched_key = re.search(key_pattern, columnData[1])
    if matched_key:
        columnData[0] = issue_dict.get(matched_key.group(), pd.np.NaN)
    else:
        columnData[0] = pd.np.NaN

print(issues_df)
  • Related