Home > Software design >  How are GitHub markdown anchor links constructed?
How are GitHub markdown anchor links constructed?

Time:06-08

From a GitHub Markdown header

# Söme/title-header_

GitHub's renderer creates the anchor

#sömetitle-header_

Apparently, spaces and / are removed, letters (ASCII and Unicode) are lowercased, and - and _ are preserved.

Is this correct; are there other rules?

CodePudding user response:

GitHub.com's process for converting Markdown heading text to id="" attributes for automatic #fragment links is not defined by any of the Markdown specifications nor implementations.

For example, it isn't described in the GitHub Flavored Markdown Spec.

Instead, it's something that GitHub do themselves privately after initial conversion from Markdown to HTML is completed, this is described in Step 4 in GitHub's own readme file on the topic (emphasis mine):

This library is the first step of a journey that every markup file in a repository goes on before it is rendered on GitHub.com:

  1. github-markup selects an underlying library to convert the raw markup to HTML.
  2. The HTML is sanitized, aggressively removing things that could harm you and your kin—such as script tags, inline-styles, and class or id attributes.
  3. Syntax highlighting is performed on code blocks. See github/linguist for more information about syntax highlighting.
  4. The HTML is passed through other filters that add special sauce, such as emoji, task lists, named anchors, CDN caching for images, and autolinking.
  5. The resulting HTML is rendered on GitHub.com.

.md / Markdown files are processed by CommonMarker libcmark, which does not include id="" attribute and #fragment URI generation as a built-in feature, but CommonMarker's documentation actually provides a sample implementation of Markdown header id="" attributes for #fragment links on the front-page, repeated below:

class MyHtmlRenderer < CommonMarker::HtmlRenderer
  def initialize
    super
    @headerid = 1
  end

  def header(node)
    block do
      out("<h", node.header_level, " id=\"", @headerid, "\">",
               :children, "</h", node.header_level, ">")
      @headerid  = 1
    end
  end
end

# this renderer prints directly to STDOUT, instead
# of returning a string
myrenderer = MyHtmlRenderer.new
print(myrenderer.render(doc))

# Print any warnings to STDERR
renderer.warnings.each do |w|
  STDERR.write("#{w}\n")
end

The above above generates numeric monotonically increasing header id="" values, which helps prevent id="" collisions (though this is not a perfect solution), whereas as you've observed GitHub prefers to use the header's textContent as the basis for id="" attribute values.

...which means that GitHub is simply doing their own thing when it comes to generating id="" attributes, and there is no published specification for whatever transformation GitHub is using.

  • Related