I have some data in a dataframe in the below format. Please see image link below
The problem I'm trying to solve is two-fold
- For the salary column and I want to separate the text and numbers and extract the value. Wherever there is a range I want to take the average
- Depending on if the salary is hourly/weekly/yearly etc I want to add a column for salary type based on if there are substring characters such as ('year','month','week','hour' etc)
The final output should look like what is in the image below
Thanks!
CodePudding user response:
This can work for you
for i in range(len(df)):
splitted_value = df["salary"].iloc[i].split()
salary_type = (splitted_value[-1] "ly").title()
if "-" in splitted_value:
ranged_salary = [int(x.replace("$","").replace(",","")) for x in splitted_value if "$" in x]
salary = sum(ranged_salary)/len(ranged_salary)
else:
salary = int(splitted_value[-3].replace("$","").replace(",",""))
df.loc[i,"salary_value"] = salary
df.loc[i,"salary_type"] = salary_type
CodePudding user response:
This is an interesting question, but next time please provide the input data as something we can copy/paste.
What you need is a function that converts the string for the salary data into the value and the salary type.
You parse over the characters in the string to find the numbers, and use a boolean switch when you encounter the -
(dash) character, in case you need to calculate an average.
lst = [
"Up to $80,000 a year",
"$8,500 - $10,500 a month",
"$25 - $40 an hour",
"$1,546 a week"
]
def convert(salary_data: str):
value = ""
value_max = ""
need_average = False
# iterate over the characters in the string
for c in salary_data:
if c.isdigit():
if need_average:
value_max = c
else:
value = c
elif c == "-":
# switch to adding to value_max after finding the dash
need_average = True
if not need_average:
# slight cheating for the f-string below
value_max = value
value = f"{(int(value) int(value_max)) / 2:.2f}"
if "hour" in salary_data:
salary_type = "hourly"
elif "week" in salary_data:
salary_type = "weekly"
elif "month" in salary_data:
salary_type = "monthly"
else:
# use this as fallback
salary_type = "yearly"
return value, salary_type
for element in lst:
value, salary_type = convert(element)
print(value, salary_type)
output
80000.00 yearly
9500.00 monthly
32.50 hourly
1546.00 weekly