Question about `dedup` in scrubr

Tags: #<Tag:0x00007f27cca3fb10>

I am working on cleaning up herbarium specimen records before I use them for later analyses. I tried to use the dedup function in the scrubr package, but I have run into an unusual replication problem.

My dataset consists of ~18000 rows. I pre-sorted the data by the columns “genus”, “epithet”, recorded_by_id, “year”, “month”, “day”, “host”, “country”, and “state_province” in that order so that duplicates should be pre-clustered. This would make it easier for me to gauge how well dedup performed at different tolerance levels. I then read the data into R keeping only the following headers: genus, epithet, scientific_name, recorded_by, year, month, day, host, host_family, cultivated, native_status, continent, country, state_province.

I ran dedup to remove duplicates. I used a subset of the records to fine tune the tolerance parameter. Specifically, there are ~60 records of Basidiophora entospora collected by A. B. Seymour, but he only made 22 collections of that species. I found that dedup(North_American_downy_mildews_tidy, how="one", tolerance=1) correctly reduced the number of records from ~60 to 22. However, I discovered that I accidentally sorted by “recorded_by_id” instead of “recorded_by”. So, I re-sorted the data and re-ran the script. This produced minimal changes in the order of the records. However, this time dedup(North_American_downy_mildews_tidy, how="one", tolerance=1) reduced the number of Basidiophora entospora collected by A. B. Seymour from ~60 to 13!

Why would changing the order of the records produce such a difference in output from dedup?

1 Like

Thanks for your question!

Can you share some of the data, including the rows that have duplicates so that I can try to reproduce the issue? If you can’t share publicly you can DM me here with the file

The work of comparing records is done by qlcMatrix::sim.strings - so I imagine will need to see how that is done to answer this

Yes, I can share the data; it’s public anyway. I am also willing to share the R script. I can upload these to github and provide a link? (I’m not sure the best way to share data on here.) Do you want the whole dataset or just the rows I was using to gauge the success of dedup?

I uploaded the data and the scripts here: legendary-chainsaw/downy_midlews at main · wjdavis90/legendary-chainsaw · GitHub

perfect, thanks. i’ll have a look soon