In praise of Commonmark: wrangle (R)Markdown files without regex

Author : Maëlle Salmon

You might have read my blog post analyzing the social weather of rOpenSci onboarding, based on a text analysis of GitHub issues. I extracted text out of Markdown-formatted threads with regular expressions. I basically hammered away at the issues using tools I was familiar with until it worked! Now I know there’s a much better and cleaner way, that I’ll present in this note. Read on if you want to extract insights about text, code, links, etc. from R Markdown reports, Hugo website sources, GitHub issues… without writing messy and smelly code!

Read the full post here: https://ropensci.org/technotes/2018/09/05/commonmark/

1 Like

Thanks Maëlle for your blog post! It reminds me two approachs built in pandoc.

The first one is to use this pandoc lua filter: https://pandoc.org/lua-filters.html#extracting-information-about-links
After merging all the md files in one all_files.md, one can execute this kind of command:

rmarkdown::pandoc_convert("all_files.md", to = "markdown", output = "count.md", options = "--lua-filter=count_links.lua")

where count_links.lua contains the referenced lua script.

The second approach is to get a json version of the pandoc AST:

rmarkdown::pandoc_convert("all_files.md", to = "json", output = "allfiles_ast.json")

It is close to the XML commonmark version.

Regards,

Romain

1 Like

Thanks Romain, this is very interesting! :ok_hand:

I am especially interested in the JSON approach since parsing JSON is well supported (e.g. rOpenSci has a jqr package!) and something one needs to learn for other applications anyway.

Merci again! :grinning:

1 Like

In lieu of lua, you can also do

rmarkdown::pandoc_convert("all_files.md", to = "markdown", output = "count.md", options = "--filter=count_links.R")

Where count_links.R is an arbitrary script that takes the JSON as stdin and emits JSON as stdout. In this case it’s R, but it could also be, say, a .jq script.

2 Likes

My mind is blown by all these nice ways to extract stuff from Markdown files :exploding_head:

Just for info, I tried this on an R Markdown file, followed by writing it back to markdown, and I didn’t get the input file exactly. :cry: It was worth trying though.

I wonder whether it is possible to get the exact input file. There are many default extensions (native_divs for instance) that could explain these differences. The default template for markdown writer can also have an impact.

However, markdown to markdown conversion can be a useful trick as explained in the pandoc wiki here : https://github.com/jgm/pandoc/wiki/Pandoc-Tricks#from-markdown-to-markdown

1 Like

Thanks, will try again soon! :nerd_face:

WIP package to modify ( R )Markdown files without regex https://github.com/ropenscilabs/tinkr