Home > Back-end >  Extract substrings from irregular text in Excel cell
Extract substrings from irregular text in Excel cell

Time:08-13

I am trying to solve this problem -

If suppose I have text like this in a single column on Excel

@22-atr$$1 AM**01-May-2015&&
$21-atr@10-Jan-2007*6 PM&
&&56-atr@11 PM$$8-Jan-2016*
**4 PM#68-atr@21-Mar-2022&&

and I want to write functions to have separate columns as follows enter image description here

Can someone help me do that please?

I am trying to solve this problem and the only thing that I was able to arrive to is extracting Month by using =MID(A1,FIND("-",A1) 1,3)

CodePudding user response:

One option for formulae would be using new functions, currently available in the BETA-channel for insiders:

enter image description here

Formula in B1:

=LET(A,TEXTSPLIT(A1,{"@","$","&","*","#"},,1),B,SORTBY(A,IFERROR(MATCH(RIGHT(A),{"r","M"},0),3)),C,HSTACK(TAKE(B,,2),TEXTSPLIT(TEXT(--INDEX(B,3),"YYYY-Mmm-D"),"-")),IFERROR(--C,C))

The idea is to:

  • Use LET() throughout to store variables;
  • TEXTSPLIT() the value in column A using all available delimiters into columns and skip empty values in the resulting array;
  • Then SORTBY() the rightmost character of the resulting three elements using MATCH(). The IFERROR() will catch the data-string;
  • We can than HSTACK() the 1st and 2nd column with the result of splitting the 3rd element after we formatted to YYYY-MMM-D first;
  • Finally, the resulting array can be multiplied by a double unary. If not, we replace it with original content from the previous variable.

Notes:

  • I formatted column C to hold time-value in AM/PM.
  • I changed the text to hold dutch month-names to have Excel recognize the dates for demonstration purposes. Should work the same with English names.

For fun an UDF using regular expressions:

Public Function GetPart(inp As String, prt As Long) As Variant
    
    Dim Pat As String
    
    Select Case prt
        Case 0
            Pat = "(\d -atr)"
        Case 1
            Pat = "(\d \s*[AP]M)"
        Case 2
            Pat = "-(\d{4})"
        Case 3
            Pat = "-(\w )-"
        Case 4
            Pat = "(\d )-\w -"
        Case Else
            Pat = ""
    End Select
        
    With CreateObject("vbscript.regexp")
        .Pattern = ".*" & Pat & ".*"
        GetPart = .Replace(inp, "$1")
    End With
        
End Function

Invoke through =GetPart(0,A1). Choices ares 0-4 and in order of your column-headers.

CodePudding user response:

You can achieve what you wish by applying a few simple transformations.

  1. Replace the @,$,* and & with a common character that is guaranteed not to appear in the data sections (e.g. #)
  2. Replace all occurrences of 2 or more runs of the # character with a single #
  3. Trim the # from the start and end of the string
  4. Split the string into an array using # as the split character (vba.split)
  5. use For Each to loop over the array
  6. In the loop have a set of three tests Test 1 tests the string for the ocurrence of "-atr" Test 2 tests the string for the occurence of "-XXX-" where XXX is a three letter month - You then split the date at the - to give an array with Day/Month/Year Test 3 Tests if the string has ' AM' or ' PM'
  • Related