Rename strings in a python list using string matching based on existing strings-CodePudding

Consider the following example of a list that contains dataframe headers based on a table I scraped:

headers = ['0 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Name  and Principal Position|', '1 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan',  '2 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Year|', '3 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Year|', '4 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan', '5 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Salary| ($)|', '6 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Salary| ($)|', '7 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan', '8 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Option  Awards| ($)|', '9 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Option  Awards| ($)|', '10 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan', '11 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Non-Equity  Incentive Plan Compensation| ($)|', '12 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Non-Equity  Incentive Plan Compensation| ($)|', '13 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan', '14 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Change  in Pension Value and Nonqualified Deferred Compensation  Earnings| ($)|', '15 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Change  in Pension Value and Nonqualified Deferred Compensation  Earnings| ($)|', '16 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan', '17 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan All  Other Compensation| ($)|', '18 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan All  Other Compensation| ($)|', '19 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan', '20 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Total| ($)|', '21 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Total| ($)|']

In this form, the headers are quite messy so I would like to standardize them. I am only interested in columns that contain specific key words so I use the following Code to filter for these in a first step:

df= df.filter(regex='Name|Year|Salary|Bonus|Period')

I am using this code to rename the string in the headers list according to the key words and to set them as new header:

headers = df.columns.values.tolist()
headers = ["Name" if "Name" in ele else ele for ele in headers]
headers = ["Year" if "Year" in ele else ele for ele in headers]
headers = ["Period" if "Period" in ele else ele for ele in headers]
headers = ["Salary" if "Salary" in ele else ele for ele in headers]
headers = ["Bonus" if "Bonus" in ele else ele for ele in headers]
df.columns = headers

So whenever a header string contains the string "Year", it is simply renamed "Year".

The code works fine as long as only one of the terms "Name", "Year", "Period", "Salary" or "Bonus" appears in a given header string at a time.

In the posted headers example list however, the key word "Year" appears in every string (in every header), so that my code will rename every string as "Year".

If a heading contains more than one of the key words, e.g. the following, that contains "Year" and "Salary"

'6 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Salary| ($)|'

I would like to check which of the terms has already been set as a heading. If "Year" has already been set as a heading, "Salary" should be the new heading of the column.

CodePudding user response：

You also have the case where both have already been used as header.

Try this:

headers = df.columns.values.tolist()
words = ['Name','year','salary','bonus','Period']
for i, header in enumerate(headers):
    for word in words:
        if word in header:
            headers[i]=word
            words.remove(word)
            break
df.columns = headers

if 2 values are present and none have been used as header, the first one based on the order in words will be chosen.
if both have already been used as header then the header will be composed of both words