In praise of Commonmark: wrangle (R)Markdown files without regex

maelle · September 5, 2018, 7:07am

You might have read my blog post analyzing the social weather of rOpenSci onboarding, based on a text analysis of GitHub issues. I extracted text out of Markdown-formatted threads with regular expressions. I basically hammered away at the issues using tools I was familiar with until it worked! Now I know there’s a much better and cleaner way, that I’ll present in this note. Read on if you want to extract insights about text, code, links, etc. from R Markdown reports, Hugo website sources, GitHub issues… without writing messy and smelly code!

Read the full post here: https://ropensci.org/technotes/2018/09/05/commonmark/

RLesur · September 5, 2018, 8:26am

Thanks Maëlle for your blog post! It reminds me two approachs built in pandoc.

The first one is to use this pandoc lua filter: https://pandoc.org/lua-filters.html#extracting-information-about-links
After merging all the md files in one all_files.md, one can execute this kind of command:

rmarkdown::pandoc_convert("all_files.md", to = "markdown", output = "count.md", options = "--lua-filter=count_links.lua")

where count_links.lua contains the referenced lua script.

The second approach is to get a json version of the pandoc AST:

rmarkdown::pandoc_convert("all_files.md", to = "json", output = "allfiles_ast.json")

It is close to the XML commonmark version.

Regards,

Romain

maelle · September 5, 2018, 8:41am

Thanks Romain, this is very interesting!

I am especially interested in the JSON approach since parsing JSON is well supported (e.g. rOpenSci has a jqr package!) and something one needs to learn for other applications anyway.

Merci again!

noamross · September 5, 2018, 1:27pm

In lieu of lua, you can also do

rmarkdown::pandoc_convert("all_files.md", to = "markdown", output = "count.md", options = "--filter=count_links.R")

Where count_links.R is an arbitrary script that takes the JSON as stdin and emits JSON as stdout. In this case it’s R, but it could also be, say, a .jq script.

maelle · September 5, 2018, 1:30pm

My mind is blown by all these nice ways to extract stuff from Markdown files

maelle · September 6, 2018, 11:21am

Just for info, I tried this on an R Markdown file, followed by writing it back to markdown, and I didn’t get the input file exactly. It was worth trying though.

RLesur · September 6, 2018, 10:16pm

I wonder whether it is possible to get the exact input file. There are many default extensions (native_divs for instance) that could explain these differences. The default template for markdown writer can also have an impact.

However, markdown to markdown conversion can be a useful trick as explained in the pandoc wiki here : https://github.com/jgm/pandoc/wiki/Pandoc-Tricks#from-markdown-to-markdown

maelle · September 10, 2018, 9:53am

Thanks, will try again soon!

maelle · September 17, 2018, 3:28pm

WIP package to modify ( R )Markdown files without regex https://github.com/ropenscilabs/tinkr

Topic		Replies	Views
tinkr: editing Markdown documents using XML tools Blog package , rmarkdown , markdown , xml2 , commonmark	0	621	October 1, 2018
rOpenSci \| Troubleshooting Pandoc Problems as an R User Blog	0	228	June 1, 2023
rOpenSci \| A Roundup of R Tools for Handling BibTeX Blog	2	402	February 3, 2022
rOpenSci \| A Roundup of R Tools for Handling BibTeX Blog	1	528	May 7, 2020
We cleaned our website URLs with R! Blog technotes , commonmark , crul	0	532	December 19, 2019

In praise of Commonmark: wrangle (R)Markdown files without regex

Related topics