Extract text :after the last white space and that is in capital letters from last , ignoring dot(.)-CodePudding

I have a text column, which basically has kind of notes, and most of the notes end with 2 to 3 capital letters after the last space in the text as in below 2 examples. And I need to extract those last characters which come after the last space into a new column either in pandas or in sql. And they should be extracted only if they are in capital letters, else null.

Ex 1 - 5723452309423 | NA | customer cancelled purchase| refund given | 12.3.2021 | approver is BG

Ex 2 - 54986866 | NA | customer order returned| refund has been given | 12.4.2021 | AKS

Ex 3 - 54986866 | NA | customer order returned| refund has been given | 12.4.2021 | KSR.

Ex 4 - 54986866 | NA | customer order returned| refund has been given | 12.4.2021 | approved by LS and processed.**

enter image description here enter image description here

CodePudding user response：

If those are just strings then you can use string.split(" ")[-1] to retrieve the last part.

my_str = "5723452309423 | NA | customer cancelled purchase| refund given | 12.3.2021 | approver is BG"
my_str.split(" ")[-1]

output is "BG" then you can use string.isupper() to check the case.

my_str.split(" ")[-1].isupper()

output is True

CodePudding user response：

df = pd.DataFrame({
    "col1": ["approver is BG", " AKS"]
})

df["col2"] = df["col1"].str.split(" ").str[-1]

df is:

          col1 col2
approver is BG   BG
           AKS  AKS

CodePudding user response：

You can use a regex to explicitly select 2/3 upper case characters in the end:

df['new'] = df['note'].str.extract(r'([A-Z]{2,3}$)')

Or more generally for the last chunk rsplit:

df['new'] = df['note'].str.rsplit('\s ', n=1).str[-1]

CodePudding user response：

If you want to do this in SQL you can ignore how many strings there are by reversing the string before cracking it apart with OPENJSON() and then reversing it again once the last element is extracted. Checking for upper case is a little cumbersome in SQL Server as well; for big lumpy and irregular strings like this you're almost certainly better off doing this in Python.

Still, given this data:

CREATE TABLE dbo.SomeTable(ID int IDENTITY, SomeColumn varchar(500));

INSERT dbo.SomeTable(SomeColumn) VALUES
('this is pure garbage.'),
('this is NOT'),
('this is JUNK'),
('5723452309423 | NA | customer cancelled purchase'
    '| refund given | 12.3.2021 | approver is BG'),
('Ex 2 - 54986866 | NA | customer order returned'
    '| refund has been given | 12.4.2021 | AKS');

This query, which I've made intentionally protective and cumbersome to illustrate why SQL Server isn't the place to do this:

;WITH x(ID, str) AS
(
  SELECT ID, REVERSE(JSON_VALUE(x.value, N'$.a'))
  FROM dbo.SomeTable AS s CROSS APPLY OPENJSON 
    ('[{"a":"' 
        REPLACE(STRING_ESCAPE(REVERSE(SomeColumn), 'json'), ' ', '"},{"a":"') 
        '"}]') 
  AS x WHERE [key] = 0
)
SELECT ID, str FROM x
WHERE LEN(str) IN (2,3) 
  AND     str  COLLATE Latin1_General_BIN 
  = UPPER(str) COLLATE Latin1_General_BIN;

Returns this data:

ID	str
2	NOT
4	BG
5	AKS

Example db<>fiddle

CodePudding user response：

your data in table

declare @a TABLE ( note varchar(max));
INSERT @a(note) VALUES
('5723452309423 | NA | customer cancelled purchase| refund given | 12.3.2021 | approver is BG'),
('54986866 | NA | customer order returned| refund has been given | 12.4.2021 | AKS'),
('58735980490 | NA | order cancelled refund requested| refund processed | 15.4.2021 |processed by : LO'),
('983943795803 | NA |_ refund asked by account owner| refunded')

In SQL you can use this query

select note,iif(lastValue = UPPER(lastValue) COLLATE SQL_Latin1_General_CP1_CS_AS,lastValue, null ) init1 from
(select note,right(rtrim(note),charindex(' ',reverse(rtrim(note)) ' ')-1) lastValue from @a) b