Home > database >  word patterns within an excel column
word patterns within an excel column

Time:08-24

I have 2 Excel data sets each comprising a column of word patterns and have been searching for a way to copy and group all instances of repetition within these columns into a new column.

This is the closest result I could find so far:

Sub Common5bis()
 Dim Joined
 Set d = CreateObject("Scripting.Dictionary")               'make dictionary
 d.CompareMode = 1                                          'not case sensitive
 a = Range("A1", Range("A" & Rows.Count).End(xlUp)).Value   'data to array
 For i = 1 To UBound(a)                                     'loop trough alle records
      If Len(a(i, 1)) >= 5 Then                             'length at least 5
           For l = 1 To Len(a(i, 1)) - 4                    'all strings withing record
                s = Mid(a(i, 1), l, 5)                      'that string
                d(s) = d(s)   1                             'increment
           Next
      End If
 Next

 Joined = Application.Index(Array(d.Keys, d.items), 0, 0)   'join the keys and the items
 With Range("D1").Resize(UBound(Joined, 2), 2)              'export range
      .EntireColumn.ClearContents                           'clear previous
      .Value = Application.Transpose(Joined)                'write to sheet
      .Sort .Range("B1"), xlDescending, Header:=xlNo        'sort descending
 End With

End Sub

Which yielded this result for the particular question: Result

This example achieves 4 of the things I'm trying to achieve:

  1. Identify repeating strings within a single column
  2. Copies these strings into a separate column
  3. Displays results in order of occurrence (in this case from least to most)
  4. Displays the quantity of repetitions (including the first instance) in an adjacent column

However, although from reading the code there are basic things I've figured out that I can adapt to my purposes, it still fails to achieve these essential tasks which I'm still trying to figure out:

  • Identify individual words rather than single characters

    • I could possibly reduce the size from 5 to 3, but for the word stings I have (lists of pronouns from larger texts) that would include "I I" repetitions but won't be so great for "Your You" etc, whilst at least 4 or 5 would miss anything starting with "I I"
  • Include an indefinite amount of values - looking at the code and the replies to the forum it comes from it looks like it's capped at 5, but I'm trying to find a way to identify all repetitions for all multiple word strings which could be something like "I I my you You Me I You my"

  • Is case sensitive - this is quite important as some words in the column have been capitalised to differentiate different uses

I'm still learning the basics of VBA but have manually typed out this example of what I'm trying to do with the code I've found above:

enter image description here

Intended outcome:

Intended Outcome

And so on

I'm a bit screwed at this point which is why I'm reaching out here (sorry if this is a stupid question, I'm brand new to VBA as my work almost never needs Excel, let alone macros) so will massively appreciate any constructive advice towards a solution!

CodePudding user response:

Because I've been working with it recently, I note that you can obtain your desired output using Power Query, available in Windows Excel 2010 and Office 365 Excel

  • Select some cell in your original table
  • Data => Get&Transform => From Table/Range or From within sheet
  • When the PQ UI opens, navigate to Home => Advanced Editor
  • Make note of the Table Name in Line 2 of the code.
  • Replace the existing code with the M-Code below
  • Change the table name in line 2 of the pasted code to your "real" table name
  • Examine any comments, and also the Applied Steps window, to better understand the algorithm and steps

First add a custom function:

  • New blank query
  • Rename per the code comment

Edits to make case-insensitive

Custom Function

//rename fnPatterns
//generate all possible patterns of two words or more
(String as text)=>

let

//split text string into individual words & get the count of words
    #"Split Words" = List.Buffer(Text.Split(String," ")),
    wordCount = List.Count(#"Split Words"),

//start position for each number of words
    starts = List.Numbers(0, wordCount-1),

//number of words for each pattern (minimum of two (2) words in a pattern
    words = List.Reverse(List.Numbers(2, wordCount-1)),

//generate patterns as index into the List and number of words
//   will be used in the List.Range function
    patterns = List.Combine(List.Generate(
        ()=>[r={{0,wordCount}}, idx=0],
        each [idx] < wordCount-1,
        each [r=List.Transform({0..starts{[idx] 1}}, (li)=> {li, wordCount-[idx]-1}), 
                idx=[idx] 1],
        each [r]
    )),

//Generate a list of all the patterns by using the List.Range function
    wordPatterns = List.Distinct(List.Accumulate(patterns, {}, (state, current)=>
        state & {List.Range(#"Split Words", current{0}, current{1})}), Comparer.OrdinalIgnoreCase)        
    
    
in
    wordPatterns

Main Function

let

//change next line to reflect data source
//if data has a column name other than "Column1", that will need to be changed also wherever referenced
    Source = Excel.CurrentWorkbook(){[Name="Table17"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),

//Create a list of all the possible patterns for each string, added as a custom column
    #"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "Patterns", each fnPatterns([Column1]), type list),

//removed unneeded original column of strings
    #"Removed Columns" = Table.RemoveColumns(#"Invoked Custom Function",{"Column1"}),

//Expand the column of lists of lists into a column of lists
    #"Expanded Patterns" = Table.ExpandListColumn(#"Removed Columns", "Patterns"),

//convert all lists to lower case for text-insensitive comparison
    #"Added Custom" = Table.AddColumn(#"Expanded Patterns", "lower case patterns", 
        each List.Transform([Patterns], each Text.Lower(_))),

//Count number of matches for each pattern
    #"Added Custom1" = Table.AddColumn(#"Added Custom", "Count", each List.Count(List.Select(#"Added Custom"[lower case patterns], (li)=> li = [lower case patterns])), Int64.Type),

//Filter for matches of more than one (1)
//  then remove duplicate patterns based on the "lower case pattern" column
    #"Filtered Rows" = Table.SelectRows(#"Added Custom1", each ([Count] > 1)),
    #"Removed Duplicates" = Table.Distinct(#"Filtered Rows", {"lower case patterns"}),

//Remove lower case pattern column and sort by count descending
    #"Removed Columns1" = Table.RemoveColumns(#"Removed Duplicates",{"lower case patterns"}),
    #"Sorted Rows" = Table.Sort(#"Removed Columns1",{{"Count", Order.Descending}}),

//Re-construct original patterns as text
    #"Extracted Values" = Table.TransformColumns(#"Sorted Rows", 
        {"Patterns", each Text.Combine(List.Transform(_, Text.From), " "), type text})
in
    #"Extracted Values"

enter image description here

Note that you could readily implement a similar algorithm using VBA, the VBA.Split function and a Dictionary

  • Related