Replacing part of string in a column with exceptions-CodePudding

I am doing a data cleaning in Jupyter Notebook with Pandas and I am trying to get just first part of the string. But there is a catch. I can easily delete rest of the expression but some fields are actually valid. So for example in the column:

SHIPMENT_PROVIDER
Usps
Usps International
Uspsxy3
Usps7kju
Usps0by

So I want to keep Usps and Usps international. So far I used following code for simpler challenges:

orders.loc[:, 'SHIPMENT_PROVIDER'] = orders.loc[:, 'SHIPMENT_PROVIDER'].replace(to_replace='(?:Usps)([a-zA-Z0-9] )$', value = 'Usps', regex = True)

But this won't work with two alternative expressions. So the idea that Usps with some random characters e.g.(Uspsxyz) should be replaced by Usps, but Usps International with some random characters e.g. (Usps Internationalxyz) should be replaced by Usps International.

CodePudding user response：

Others have posted regex solutions. How about a non-regex solution:

s = orders["SHIPMENT_PROVIDER"]
orders["SHIPMENT_PROVIDER"] = np.select(
    [s.str.startswith("Usps International"), s.str.startswith("Usps")],
    ["Usps International", "Usps"],
    s,
)

CodePudding user response：

As a pattern, you could use a capture group for the first part instead with an optional part for International

^(Usps)(?: International)?[a-zA-Z0-9] $

In the replacement use group 1.

Regex demo

import pandas as pd

pattern = r"^(Usps)(?: International)?[a-zA-Z0-9] $"

items = [
    "Usps",
    "Usps International",
    "Uspsxy3",
    "Usps7kju",
    "Usps0by",
    "Usps Internationalxyz"
]

orders = pd.DataFrame(items, columns=["SHIPMENT_PROVIDER"])
orders.loc[:, 'SHIPMENT_PROVIDER'] = orders.loc[:, 'SHIPMENT_PROVIDER'].replace(r"^(Usps)(?: International)?[a-zA-Z0-9] $", r"\1", regex=True)
print(orders)

Output

    SHIPMENT_PROVIDER
0                Usps
1  Usps International
2                Usps
3                Usps
4                Usps
5                Usps