How to extract specific substrings and separate text from numbers in pandas dataframe?-CodePudding

I have some data in a dataframe in the below format. Please see image link below

Current Output

The problem I'm trying to solve is two-fold

For the salary column and I want to separate the text and numbers and extract the value. Wherever there is a range I want to take the average
Depending on if the salary is hourly/weekly/yearly etc I want to add a column for salary type based on if there are substring characters such as ('year','month','week','hour' etc)

The final output should look like what is in the image below

Expected Output

Thanks!

CodePudding user response：

This can work for you

for i in range(len(df)):
    splitted_value = df["salary"].iloc[i].split()
    salary_type = (splitted_value[-1] "ly").title()
    if "-" in splitted_value:
        ranged_salary = [int(x.replace("$","").replace(",","")) for x in splitted_value if "$" in x]
        salary = sum(ranged_salary)/len(ranged_salary)
    else:
        salary = int(splitted_value[-3].replace("$","").replace(",",""))
    df.loc[i,"salary_value"] = salary
    df.loc[i,"salary_type"] = salary_type

CodePudding user response：

This is an interesting question, but next time please provide the input data as something we can copy/paste.

What you need is a function that converts the string for the salary data into the value and the salary type.

You parse over the characters in the string to find the numbers, and use a boolean switch when you encounter the - (dash) character, in case you need to calculate an average.

lst = [
    "Up to $80,000 a year",
    "$8,500 - $10,500 a month",
    "$25 - $40 an hour",
    "$1,546 a week"
]


def convert(salary_data: str):
    value = ""
    value_max = ""
    need_average = False
    # iterate over the characters in the string
    for c in salary_data:
        if c.isdigit():
            if need_average:
                value_max  = c
            else:
                value  = c
        elif c == "-":
            # switch to adding to value_max after finding the dash
            need_average = True
    if not need_average:
        # slight cheating for the f-string below
        value_max = value
    value = f"{(int(value)   int(value_max)) / 2:.2f}"
    if "hour" in salary_data:
        salary_type = "hourly"
    elif "week" in salary_data:
        salary_type = "weekly"
    elif "month" in salary_data:
        salary_type = "monthly"
    else:
        # use this as fallback
        salary_type = "yearly"
    return value, salary_type


for element in lst:
    value, salary_type = convert(element)
    print(value, salary_type)

output

80000.00 yearly
9500.00 monthly
32.50 hourly
1546.00 weekly