I just finished writing a package called
ptstem. This is just a wrapper for some stemming algorithms available in R for the portuguese language. I was planning to submit it to ROpenSci, for this, i was reading the policies in the onboard repository.
I’m in doubt if the package fits. I think it fits the data extraction topic:
data extraction tools that aid in retrieving data from unstructured sources such as text, images and PDFs, as well as scientific data formats and outputs from scientific equipment.
But I’m not sure, since it’s nota tool to retrieve data from any source. Before submitting I would like to know if it fits.
What do you think?
Hi @dfalbel Thanks for your question and thanks for thinking about rOpenSci
We do have some text mining tools, so its possible it could fit. I do have a few questions:
- Does another package already allow this? That is, does
ptstem duplicate something else?
- Does this make doing stemming with portuguese significantly easier over existing tools?
thoughts @noamross ?
Thanks @sckott !
I don’t know any other packages to do stemming specifically for the portuguese language. You can use
SnowballC for the Porter algorithm,
hunspell (but you have to download a dict and it returns a list of stems) or the
rslp package. Each one with a different API.
ptstem makes doing stemming easier by providing an unified API. It already contains the hunspell dictionary for the portuguese language and transforms hunspell stems in a single output (instead of the list). It also provides a function to assess the quality of a stemmer.
As it’s possible to fit, I’ll work in the submission next week.
The other editors and I had a long back-and forth about whether this fit and find its an edge case. I’ll sum up our thoughts - others please chime in:
- Tools that localize or extend accessibility to multiple languages for things in our core areas are good
- Data processing for text mining is in our wheelhouse (though we’ll draw the limit at actual analysis - topic modeling, for instance - because that’s a statistical topic that we don’t claim expertise in.)
But there’s another area that we haven’t expressed adequately in our policies, which is generalizability: In order to avoid a fragmented suite of packages, we much prefer packages that solve a problem generally rather than specifically, when there’s not a large gap in implementation requirements. For instance, if multiple data sources used the same API, we would ask that a package wrapping that API access all the data sources, rather than just one.
In the case of pstem, we think it accomplishes an important but narrow task that should be easily generalizable: wrapping stemming algorithms for one language. We could easily envision many near-identical packages for other languages, and think there is a straightforward path to a general package. In this case, it would be one that, based on the language input by the user, retrieved the relevant dictionary for the hunspell-based algorithm, called the appropriate SnowballC algorithm, or called an appropriate language-specific algorithm if it exists, as in the case of rslp.
Based on this, I would come down on the side of this not being a good fit, but if you are interested in implementing a more general approach, we’d be happy to help.
I’ll add some language to our policies page based on this conversation once I get some more input. @dfalbel, I hope its not discouraging that you fell into this grey area.
@noamross Thank you very much for your answer!
I agree with you that if
ptstem fits, a lot of packages like
enstem would appear and this doesn’t make sense for ROpenSci. I don’t think I’ll have time soon to implement a more general approach, but I’ll put this into my to-do list for the next months.
It’s not discouraging at all, I admire the ROpenSci project, and I look forward to contributing.