Home > OS >  Extracting hashtags and sections in document in ruby
Extracting hashtags and sections in document in ruby

Time:10-11

I have a markdown text document with several sections and just below hashtags of the section. The hashtags are in the form #oneword# or #multiple words hashtag#.

I need to extract sections and their hashtags in ruby.

Example

# Section 1

#hash1# #hash tag 2# #hashtag3#

Some text

# Section 2

#hash1# #hash tag 4# #hash tag2#


Some text too

I want to get

{"Section 1"=>["#hash1#", "#hash tag 2#", "#hashtag3#"],
 "Section 2"=>["#hash1#", "#hash tag 4#", "#hash tag2#"]}

Can we get in from grep?

CodePudding user response:

My example being:

# Section 1

#hash1# #hash tag 2# #hashtag3#
#more hashes# #and more hashes#

only a # FakeSection

Some text

# Section 2

#hash1# #hash tag 4# #hash tag2#

Some text too

and this code (ruby 3.1.2p20):

SECTION_REGEX = /^#[^#]*$/
HASH_REGEX = /#[^#]*#/

text = #...

# Iteration section key
key = nil

# Loop all the lines in the text
result = text.split("\n").each_with_object({}) do |line, memo|
  # If matches a section, set the section as the key for your result
  next key = line.delete('#').strip if line.match?(SECTION_REGEX)
  # If there is still no section to append hashes, skip until there is
  next if key.nil?

  # If code reaches this line, it means it is a line between sections
  # Matches the regex groups you need and returns them to a array
  matches = line.scan(HASH_REGEX)
  # Concats it to an array
  (memo[key] ||= []).concat(matches)
end

The following result is

{
  "Section 1"=>["#hash1#", "#hash tag 2#", "#hashtag3#", "#more hashes#", "#and more hashes#"],
  "Section 2"=>["#hash1#", "#hash tag 4#", "#hash tag2#"]
}

Just be careful, I created this regex myself, so it might have unexpected behaviour for other markdown tags (since I didn't think of them while making this code), but seems to work fine with your example

Hope it helps!!

CodePudding user response:

When faced with a problem such as this I tend to prefer the to use the builder pattern. It is a little verbose, but is normally very readable and very flexible.

The main idea is you have a "reader" that simply looks at your input and looks for "tokens', in this case lines, and when it finds a token that it recognizes it informs the builder that it found a token of interest. The builder builds another object based on input from the "reader". Here is an example of a "DocumentBuilder" that takes input from a "MarkdownReader" that builds the Hash that you are looking for.

class MarkdownReader
    attr_reader :builder

    def initialize(builder)
        @builder = builder
    end

    def parse(lines)
        lines.each do |line|
            case line
            when /^#[^#] $/
                builder.convert_section(line)
            when /^#. \#$/
                builder.convert_hashtag(line)
            end
        end
    end
end

class DocumentBuilder
    attr_reader :document

    def initialize()
        @document = {}
    end

    def convert_section(line)
        line =~ /^#(. )$/
        @section_name = $1
        document[@section_name] = []
    end
    
    def convert_hashtag(line)
        hashtags = line.split("#").reject {_1.strip.empty?}
        document[@section_name]  = hashtags
    end
end

lines = File.readlines("markdown.md")
builder = DocumentBuilder.new 
reader = MarkdownReader.new(builder)
reader.parse(lines)
p builder.document

    => {" Section 1"=>["hash1", "hash tag 2", "hashtag3"], " Section 2"=>["hash1", "hash tag 4", "hash tag2"]}
  • Related