rOpenSci package or resource used*
Review template from the package development guide
Bonus: gert to automatically retrieve the hash of the commit of the reviewed version (gert::git_commit_id()
)
What did you do?
Informal package review for teammates before submission to a journal.
URL or code snippet for your use case*
opened 02:26PM - 27 Jul 21 UTC
closed 01:42PM - 29 Mar 22 UTC
## Preface
This is an informal review conducted by a lab member. To ensure ma⌠ximal objectivity, the [rOpenSci review template](https://devguide.ropensci.org/reviewtemplate.html) is used. This template also guarantees that this package is following the most up-to-date and strictest standards available in the R community.
The template is released under CC-BY-NC-SA and this review is therefore published under the same license.
The review was finished on 2021-07-27 and concerns the version 0.1.7.2 of scoringutils (commit de45fb7d78e1a334fe9e24cf3435f648c82a3cb8).
## Package Review
#### Documentation
The package includes all the following forms of documentation:
- â **A statement of need** clearly stating problems the software is designed to solve and its target audience in README
> The scoringutils package provides a collection of metrics and proper scoring rules that make it simple to score forecasts against the true observed values.
- [x] **Installation instructions:** for the development version of package and any non-standard dependencies in README
- [x] **Vignette(s)** demonstrating major functionality that runs successfully locally
- [x] Multiple plots are impossible to read in the CRAN-rendered version of the vignette. The plot dimensions should be adjusted.
- [x] I would refrain from using dispensable non-base packages in the package as it introduces extra cognitive load for users. In addition to learning how to use scoringutils, they now possibly have to understand how to use whichever non-base package you use. (PR \#119)
- [x] I recommended using the `fct()` when talking about functions. It is then easier to make the difference between functions and other objects and it enables auto-linking to the function documentation in the pkgdown website. (PR \#119)
- [x] **Function Documentation:** for all exported functions
- Good job writing what is the default argument value for many functions!
- [ x] I recommend using the roxygen2 markdown syntax (done in \#115)
- [x] More functions should inherit documentation chunks to reduce risks of errors / outdated docs when updating
- [x] Itâs seem strange that `eval_forecasts_quantile()` doesnât have the same treatment as, e.g., `eval_forecasts_binary()` and get some basic documentation.
- [x] There is some inconsistency in formatting when using mentions such as âonly required if `plot = TRUE`â. Sometimes, it reads âonly required if `plot == TRUE`â, but not always. I recommend switching all these statements using the phrasing âif `plot = TRUE`, which seems more common in the R community.
- [x] There is a minor issue with the equation rendering in the pkgdown website (e.g., <https://epiforecasts.io/scoringutils/reference/abs_error.html>). The solution is probably to pass both a LaTeX/mathjax and a ASCII version of the equation to `\deqn{}`.
- [x] The `metrics` argument should point to the `list_of_avail_metrics()` function so users know what their choices are.
- [x] There might be a typo in the documentation of the `summarised` argument in `eval_forecasts()`. Do you mean `summarise_by` instead of `group_by`? <https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/eval_forecasts.R#L109-L110>
- [x] I do not really understand the documentation of the `test_options` argument in `compare_two_models()`. It reads: > list with options to pass down to `compare_two_models`
But we are already in the documentation of compare\_two\_models and I donât see any more documentation of this argument and which values are possible.
- [x] I donât understand the documentation of the `rel_skill_metric` argument in `eval_forecasts()`. It reads:
> If equal to âautoâ (the default), then one of interval score, crps or brier score will be used where appropriate
What do you mean by âwhere appropriateâ, how do you choose which one these indices is used.
- [ ] Arenât some metrics missing from the `@details` in `eval_forecasts()`. For example, where is the coverage?
- [x] **Examples** (that run successfully locally) for all exported functions
Some functions are missing examples:
- [x] `compare_two_models()`
- [x] `pairwise_comparison_one_group()`
- [x] `quantile_to_long()`
- [x] `quantile_to_wide()`
- [x] `range_to_quantile()`
- [x] `quantile_to_range()`
- [x] `sample_to_range()`
- [x] `merge_pred_and_obs()`
- [x] `hist_PIT()`
- [x] `hist_PIT_quantile()`
- [x] **Community guidelines** including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with `URL`, `BugReports` and `Maintainer` (which may be autogenerated via `Authors@R`).
- [x] There is no CONTRIBUTING guide. Adding one could signal more explicitly that external contributions are welcome.
- [ x] The link to the pkgdown website should be added in `DESCRIPTION` (alongside the URL to the GitHub repo). This helps users looking for documentation but it is also necessary to enable automatic linking from other pkgdown websites (done in PR \#118).
#### Functionality
- [x] **Installation:** Installation succeeds as documented.
- [x] **Functionality:** Any functional claims of the software been confirmed.
- I did not verify the package works for R versions as old as 2.10 but upon cursory glance, I could find use of any of the ânewerâ base R functions (listed in <https://github.com/r-lib/backports>).
- <s>**Performance:** Any performance claims of the software been confirmed.</s>
- [ ] **Automated tests:** Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
- Regarding continuous integration, the package uses the current best practices in the community: GitHub actions with tests on various OS and R versions.
- [ ] However, testing is the weakest point of this package. The coverage at the time of writing this review is at 48.9469862, when a package of this type should aim for around 90%% (excluding plotting functions that are usually more difficult to integrate in tests). Additionally, this value is somewhat inflated by âweakâ tests that only test if the function is producing an output but does not check this output. A good practice for a package of this type would be to compare its output to âknown true valuesâ from the literature. Ideally, papers where these indices are defined.
- [x] **Packaging guidelines**: The package conforms to the rOpenSci packaging guidelines
- In terms of interface and what functionality is exposed to users, I think this package does exactly the right thing by providing access to both the low-level individual scores and to the higher-level `eval_forecasts()`.
- [x] There are some inconsistencies in naming. I mostly noticed this while creating the pkgdown reference index (PR \#113). With just the function/object names, it was difficult to infer what each function would return. I would recommend using the prefix `example_` for all example data objects. It is present in all names currently but at a different position in many times. Similarly, it would be helpful to have the plotting functions start with `plot_` or something of the sort.
- It would be useful to tag each release on GitHub as some users might rely on this to get informed of a new release. It is made easier with the `usethis::use_github_release()` function.
- [x] News items do not appear for some versions on <https://epiforecasts.io/scoringutils/news/index.html>. This is because the version number and the subsection have the same heading level. I submitted a fix in \#116.
- [x] The README (`Readme.md`) and the CHANGELOG (`NEWS.md`) should be added to the package tarball so there are available with the package on CRAN. This is done by removing these files from `.Rbuildignore` (done in \#117)
- [x] It should be easy to remove the dependency on forcats and I would be inclined to remove it as it introduces an extra potential source of breakage for a very minimal gain (as opposed to dependency on data.table or ggplot2). But since all the recursive dependencies of forcats are also dependencies of ggplot2, I would understand it the lead developer decides itâs not worth the trouble.
- [x] It is generally recommended to include a section âSimilar projectsâ in the README where you can highlight the differences and strengths of this tool against the existing alternatives. In this specific case, I would be interested in hearing about the differences with scoringRules which is one of the dependencies.
Estimated hours spent reviewing: 13h
------------------------------------------------------------------------
### Review Comments / Code Review
- [x] I noticed some occurrences of `bool == TRUE` which doesnât really make sense and makes the code more difficult to read. If the object is already a logical, you can use it as-is in `if`. For example:
<https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/bias.R#L86-L90>
could be simplified as
continuous_predictions <- !all.equal(as.vector(predictions), as.integer(predictions))
Some other occurrences:
<https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/eval_forecasts_continuous_integer.R#L52>
<https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pit.R#L157>
- Good job on specifying the type of your `NA` (`NA_real_`, `NA_integer_`, etc.)!
- [x] I *think* there is some inconsistency in type / value checking before computation. For example, in `brier_score()`, there is a check that `predictions` takes values between 0 and 1 but this check is not present in, e.g., `bias()`. Would it make sense to have `check_truth()`, `check_predictions()` internal functions that you call each time?
- [x] Would it be worth printing a warning when the user requests âcoverageâ in `eval_forecasts_sample()` instead of silently dropping it?
- [ ] As mentioned in another discussion, there is some inconsistency in the use of data.table and modifying in place vs copying. Beyond the stylistic issue, this is a possible source of bugs so Iâd recommend sticking to one or the other.
- [x] Some functions have a very large number of arguments which makes them difficult to use. Research in software engineering tends to suggests to the number of arguments should not exceed ~7. As the function is complex, it may be difficult to reduce the number of arguments but here are possible some ways:
- [x] drop the `verbose` argument. Either the diagnostic messages are unnecessary and should be dropped entirely or they are useful and should be printed. If users really donât want to see messages/warnings, they can use base functions `suppressMessages()` / `suppressWarnings()`. This could also be controlled by a global switch in `options()` like usethis is doing.
- [x] plotting side effects could be removed. The primary goal of `eval_forecasts()` is to return a data.frame with the scores. Users that want the plot could call another function afterwards. This would allow the removal of the `pit_plots` argument.
- [x] remove the possibility of having either a single `data` or `forecasts`, `truth_data` and `merge_by` as inputs.
- [x] get rid of `summarised` and act as if `summarised = TRUE` if `by != summarise_by` (is this \#106?)
I understand the wish to provide flexibility but since `eval_forecasts()` is meant to be a high-level wrapper / one-liner to compute everything, I believe itâs okay to provide a limited interface. Users that strive for more flexibility can always use the low-level individual scoring functions.
- [x] I *think* there is a misplaced closing parenthesis here and you mean
if (length(cols_to_delete) > 1) {
instead of the current:
<https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pairwise-comparisons.R#L165>
Same here:
<https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pairwise-comparisons.R#L220>
If this is indeed a bug, its presence twice in the same file suggests this code portion should be refactored as a function (isnât that the purpose of `delete_columns()` already?). By the way, why is it `> 1` in one case and `> 0` in the other?
- [x] I cannot say for sure because as mentioned previously, I donât understand the documentation of the `test_options` argument in `compare_two_models()` but this selection of the first element (`[1]`) does not seem super robust: <https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pairwise-comparisons.R#L373>
- [x] There are some occurrence where loops (`vapply()`) are used when you could rely on vectorized functions / linear algebra for much faster computation and a more readable code (fixed in \#120):
[ ] <https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/bias.R#L95-L99>
[ ] <https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/bias.R#L106-L110>
[ ] <https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pit.R#L169-L173>
[ ] <https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pit.R#L193-L197>
predictions <- matrix(rnorm(100), ncol = 10, nrow = 10)
true_values <- rnorm(10)
microbenchmark::microbenchmark(
"vapply" = { vapply(seq_along(true_values), function(i) sum(predictions[i, ] <= true_values[i]), numeric(1)) },
"vector" = rowSums(predictions <= true_values),
check = "identical"
)
## Unit: microseconds
## expr min lq mean median uq max neval
## vapply 15.492 16.4605 17.74523 16.8215 17.302 47.741 100
## vector 5.161 5.7475 6.24385 5.9215 6.089 34.897 100
- [x] In ggplot2 plots with the `facet_wrap_or_grid` argument, I would change the default value to `c("facet_wrap", "facet_grid")` and start the function with:
facet_wrap_or_grid <- match.arg(facet_wrap_or_grid)
Currently, a very minor and inconspicuous typo such as `"facet_warp"` would make it silently switch to `facet_grid` and it would be very difficult to notice the mistake.
- [x] All functions in `scoringRules_wrappers.R` seem to have the same checks at the beginning. It would be less error-prone to refactor this.
- [x] This is not a robust way to get the value of an argument with defaults:
<https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/utils.R#L208>
<https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/utils.R#L220>
<https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/utils_data_handling.R#L422-L430>
Instead, you should use:
comparison_mode <- match.arg(comparison_model)
...
if (comparison_mode == "ratio") { ... }
- Deprecated functions from `utils_data_handling.R` should be categorised as such in the pkgdown reference index.
- There is no unit testing here since this is not a testthat function:
<https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/tests/testthat/test-bias.R#L48>
### Conclusion
This is overall a solid package that could become a widely used tool in forecast sciences. I could not see any bugs in the code and the performance looks very good on the examples I ran. The package interface is clever and can surely prove useful to a large array of users thanks to the two levels of functions (low-level scoring functions vs all-in-one `eval_forecasts()`).
Two points could slow down / reduce adoption and these should be fixed for this package to reach its full potential and attract as many users as possible:
- the package remains complex to use. This complexity is in part inherent to the task but it could nonetheless be reduced by following best practices in software engineering such as reducing the number of parameters and adopting a consistent naming scheme.
- there is no strong evidence that this package implements correctly the computed metrics. This is especially important for fields that can have a policy impact. Test coverage should be increased and comparisons to computation via other tools / methods should be added.
Sector
academic
Field(s) of application
epidemiology
Comments
I would find it useful to provide this template as a Rmd file instead of md. It would allow to add some code snippets showing, e.g., that a given pattern / structure is more performant than another (small microbenchmark::microbenchmark()
chunk), current code coverage (r covr::percent_coverage(covr::package_coverage())
), of current git commit hash (gert::git_commit_id()
). I achieved this by adding:
---
title: "review"
output:
rmarkdown::md_document:
pandoc_args: [
"--wrap=none"
]
---
at the beginning of the provided template.
Twitter handle
Package authors: @nikosbosse , @seabbs , @sbfnk
Package reviewer: @grusonh
3 Likes