The other editors and I had a long back-and forth about whether this fit and find its an edge case. I’ll sum up our thoughts - others please chime in:
- Tools that localize or extend accessibility to multiple languages for things in our core areas are good
- Data processing for text mining is in our wheelhouse (though we’ll draw the limit at actual analysis - topic modeling, for instance - because that’s a statistical topic that we don’t claim expertise in.)
But there’s another area that we haven’t expressed adequately in our policies, which is generalizability: In order to avoid a fragmented suite of packages, we much prefer packages that solve a problem generally rather than specifically, when there’s not a large gap in implementation requirements. For instance, if multiple data sources used the same API, we would ask that a package wrapping that API access all the data sources, rather than just one.
In the case of pstem, we think it accomplishes an important but narrow task that should be easily generalizable: wrapping stemming algorithms for one language. We could easily envision many near-identical packages for other languages, and think there is a straightforward path to a general package. In this case, it would be one that, based on the language input by the user, retrieved the relevant dictionary for the hunspell-based algorithm, called the appropriate SnowballC algorithm, or called an appropriate language-specific algorithm if it exists, as in the case of rslp.
Based on this, I would come down on the side of this not being a good fit, but if you are interested in implementing a more general approach, we’d be happy to help.
I’ll add some language to our policies page based on this conversation once I get some more input. @dfalbel, I hope its not discouraging that you fell into this grey area.