Consider the following example of a list that contains dataframe headers based on a table I scraped:
headers = ['0 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Name and Principal Position|', '1 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '2 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Year|', '3 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Year|', '4 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '5 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Salary| ($)|', '6 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Salary| ($)|', '7 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '8 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Option Awards| ($)|', '9 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Option Awards| ($)|', '10 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '11 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Non-Equity Incentive Plan Compensation| ($)|', '12 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Non-Equity Incentive Plan Compensation| ($)|', '13 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '14 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Change in Pension Value and Nonqualified Deferred Compensation Earnings| ($)|', '15 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Change in Pension Value and Nonqualified Deferred Compensation Earnings| ($)|', '16 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '17 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan All Other Compensation| ($)|', '18 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan All Other Compensation| ($)|', '19 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '20 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Total| ($)|', '21 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Total| ($)|']
In this form, the headers are quite messy so I would like to standardize them. I am only interested in columns that contain specific key words so I use the following Code to filter for these in a first step:
df= df.filter(regex='Name|Year|Salary|Bonus|Period')
I am using this code to rename the string in the headers list according to the key words and to set them as new header:
headers = df.columns.values.tolist()
headers = ["Name" if "Name" in ele else ele for ele in headers]
headers = ["Year" if "Year" in ele else ele for ele in headers]
headers = ["Period" if "Period" in ele else ele for ele in headers]
headers = ["Salary" if "Salary" in ele else ele for ele in headers]
headers = ["Bonus" if "Bonus" in ele else ele for ele in headers]
df.columns = headers
So whenever a header string contains the string "Year", it is simply renamed "Year".
The code works fine as long as only one of the terms "Name", "Year", "Period", "Salary" or "Bonus" appears in a given header string at a time.
In the posted headers example list however, the key word "Year" appears in every string (in every header), so that my code will rename every string as "Year".
If a heading contains more than one of the key words, e.g. the following, that contains "Year" and "Salary"
'6 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Salary| ($)|'
I would like to check which of the terms has already been set as a heading. If "Year" has already been set as a heading, "Salary" should be the new heading of the column.
CodePudding user response:
You also have the case where both have already been used as header.
Try this:
headers = df.columns.values.tolist()
words = ['Name','year','salary','bonus','Period']
for i, header in enumerate(headers):
for word in words:
if word in header:
headers[i]=word
words.remove(word)
break
df.columns = headers
- if 2 values are present and none have been used as header, the first one based on the order in words will be chosen.
- if both have already been used as header then the header will be composed of both words