I am doing a data cleaning in Jupyter Notebook with Pandas and I am trying to get just first part of the string. But there is a catch. I can easily delete rest of the expression but some fields are actually valid. So for example in the column:
SHIPMENT_PROVIDER |
---|
Usps |
Usps International |
Uspsxy3 |
Usps7kju |
Usps0by |
So I want to keep Usps and Usps international. So far I used following code for simpler challenges:
orders.loc[:, 'SHIPMENT_PROVIDER'] = orders.loc[:, 'SHIPMENT_PROVIDER'].replace(to_replace='(?:Usps)([a-zA-Z0-9] )$', value = 'Usps', regex = True)
But this won't work with two alternative expressions. So the idea that Usps with some random characters e.g.(Uspsxyz) should be replaced by Usps, but Usps International with some random characters e.g. (Usps Internationalxyz) should be replaced by Usps International.
CodePudding user response:
Others have posted regex solutions. How about a non-regex solution:
s = orders["SHIPMENT_PROVIDER"]
orders["SHIPMENT_PROVIDER"] = np.select(
[s.str.startswith("Usps International"), s.str.startswith("Usps")],
["Usps International", "Usps"],
s,
)
CodePudding user response:
As a pattern, you could use a capture group for the first part instead with an optional part for International
^(Usps)(?: International)?[a-zA-Z0-9] $
In the replacement use group 1.
import pandas as pd
pattern = r"^(Usps)(?: International)?[a-zA-Z0-9] $"
items = [
"Usps",
"Usps International",
"Uspsxy3",
"Usps7kju",
"Usps0by",
"Usps Internationalxyz"
]
orders = pd.DataFrame(items, columns=["SHIPMENT_PROVIDER"])
orders.loc[:, 'SHIPMENT_PROVIDER'] = orders.loc[:, 'SHIPMENT_PROVIDER'].replace(r"^(Usps)(?: International)?[a-zA-Z0-9] $", r"\1", regex=True)
print(orders)
Output
SHIPMENT_PROVIDER
0 Usps
1 Usps International
2 Usps
3 Usps
4 Usps
5 Usps