Question about `dedup` in scrubr

I am working on cleaning up herbarium specimen records before I use them for later analyses. I tried to use the dedup function in the scrubr package, but I have run into an unusual replication problem.

My dataset consists of ~18000 rows. I pre-sorted the data by the columns “genus”, “epithet”, recorded_by_id, “year”, “month”, “day”, “host”, “country”, and “state_province” in that order so that duplicates should be pre-clustered. This would make it easier for me to gauge how well dedup performed at different tolerance levels. I then read the data into R keeping only the following headers: genus, epithet, scientific_name, recorded_by, year, month, day, host, host_family, cultivated, native_status, continent, country, state_province.

I ran dedup to remove duplicates. I used a subset of the records to fine tune the tolerance parameter. Specifically, there are ~60 records of Basidiophora entospora collected by A. B. Seymour, but he only made 22 collections of that species. I found that dedup(North_American_downy_mildews_tidy, how="one", tolerance=1) correctly reduced the number of records from ~60 to 22. However, I discovered that I accidentally sorted by “recorded_by_id” instead of “recorded_by”. So, I re-sorted the data and re-ran the script. This produced minimal changes in the order of the records. However, this time dedup(North_American_downy_mildews_tidy, how="one", tolerance=1) reduced the number of Basidiophora entospora collected by A. B. Seymour from ~60 to 13!

Why would changing the order of the records produce such a difference in output from dedup?

1 Like

Thanks for your question!

Can you share some of the data, including the rows that have duplicates so that I can try to reproduce the issue? If you can’t share publicly you can DM me here with the file

The work of comparing records is done by qlcMatrix::sim.strings - so I imagine will need to see how that is done to answer this

Yes, I can share the data; it’s public anyway. I am also willing to share the R script. I can upload these to github and provide a link? (I’m not sure the best way to share data on here.) Do you want the whole dataset or just the rows I was using to gauge the success of dedup?

I uploaded the data and the scripts here: legendary-chainsaw/downy_midlews at main · wjdavis90/legendary-chainsaw · GitHub

perfect, thanks. i’ll have a look soon

Sorry took me so long to get back to you.

I downloaded files from that repo.

I ran through both 20210209-7_script.R and 20210209-8_script.R, then subset to records with

dplyr::filter(w, cur_scientific_name == "Basidiophora entospora", recorded_by == "A B Seymour")

However, I included recorded_by_id, but that column is always empty. So I’m not sure how it would affect the dedup() operation. Maybe you can share a completely reproducible example with the subset with Basidiophora entospora/A B Seymour for me to test