You might have read my blog post analyzing the social weather of rOpenSci onboarding, based on a text analysis of GitHub issues. I extracted text out of Markdown-formatted threads with regular expressions. I basically hammered away at the issues using tools I was familiar with until it worked! Now I know there’s a much better and cleaner way, that I’ll present in this note. Read on if you want to extract insights about text, code, links, etc. from R Markdown reports, Hugo website sources, GitHub issues… without writing messy and smelly code!
I am especially interested in the JSON approach since parsing JSON is well supported (e.g. rOpenSci has a jqr package!) and something one needs to learn for other applications anyway.
rmarkdown::pandoc_convert("all_files.md", to = "markdown", output = "count.md", options = "--filter=count_links.R")
Where count_links.R is an arbitrary script that takes the JSON as stdin and emits JSON as stdout. In this case it’s R, but it could also be, say, a .jq script.
Just for info, I tried this on an R Markdown file, followed by writing it back to markdown, and I didn’t get the input file exactly. It was worth trying though.
I wonder whether it is possible to get the exact input file. There are many default extensions (native_divs for instance) that could explain these differences. The default template for markdown writer can also have an impact.