I have a text column, which basically has kind of notes, and most of the notes end with 2 to 3 capital letters after the last space in the text as in below 2 examples. And I need to extract those last characters which come after the last space into a new column either in pandas or in sql. And they should be extracted only if they are in capital letters, else null.
Ex 1 - 5723452309423 | NA | customer cancelled purchase| refund given | 12.3.2021 | approver is BG
Ex 2 - 54986866 | NA | customer order returned| refund has been given | 12.4.2021 | AKS
CodePudding user response:
If those are just strings then you can use string.split(" ")[-1] to retrieve the last part.
my_str = "5723452309423 | NA | customer cancelled purchase| refund given | 12.3.2021 | approver is BG"
my_str.split(" ")[-1]
output is "BG" then you can use string.isupper() to check the case.
my_str.split(" ")[-1].isupper()
output is True
CodePudding user response:
df = pd.DataFrame({
"col1": ["approver is BG", " AKS"]
})
df["col2"] = df["col1"].str.split(" ").str[-1]
df is:
col1 col2
approver is BG BG
AKS AKS
CodePudding user response:
You can use a regex to explicitly select 2/3 upper case characters in the end:
df['new'] = df['note'].str.extract(r'([A-Z]{2,3}$)')
Or more generally for the last chunk rsplit
:
df['new'] = df['note'].str.rsplit('\s ', n=1).str[-1]
CodePudding user response:
If you want to do this in SQL you can ignore how many strings there are by reversing the string before cracking it apart with OPENJSON()
and then reversing it again once the last element is extracted. Checking for upper case is a little cumbersome in SQL Server as well; for big lumpy and irregular strings like this you're almost certainly better off doing this in Python.
Still, given this data:
CREATE TABLE dbo.SomeTable(ID int IDENTITY, SomeColumn varchar(500));
INSERT dbo.SomeTable(SomeColumn) VALUES
('this is pure garbage.'),
('this is NOT'),
('this is JUNK'),
('5723452309423 | NA | customer cancelled purchase'
'| refund given | 12.3.2021 | approver is BG'),
('Ex 2 - 54986866 | NA | customer order returned'
'| refund has been given | 12.4.2021 | AKS');
This query, which I've made intentionally protective and cumbersome to illustrate why SQL Server isn't the place to do this:
;WITH x(ID, str) AS
(
SELECT ID, REVERSE(JSON_VALUE(x.value, N'$.a'))
FROM dbo.SomeTable AS s CROSS APPLY OPENJSON
('[{"a":"'
REPLACE(STRING_ESCAPE(REVERSE(SomeColumn), 'json'), ' ', '"},{"a":"')
'"}]')
AS x WHERE [key] = 0
)
SELECT ID, str FROM x
WHERE LEN(str) IN (2,3)
AND str COLLATE Latin1_General_BIN
= UPPER(str) COLLATE Latin1_General_BIN;
Returns this data:
ID | str |
---|---|
2 | NOT |
4 | BG |
5 | AKS |
- Example db<>fiddle
CodePudding user response:
In SQL you can use this query
select iif(lastValue = UPPER(lastValue) COLLATE SQL_Latin1_General_CP1_CS_AS,lastValue, null )
from
(select right(rtrim(note),charindex(' ',reverse(rtrim(note)) ' ')-1) lastValue from yourtable) b