Hello,
I am working on cleaning up herbarium specimen records before I use them for later analyses. I tried to use the dedup
function in the scrubr package, but I have run into an unusual replication problem.
My dataset consists of ~18000 rows. I pre-sorted the data by the columns “genus”, “epithet”, recorded_by_id, “year”, “month”, “day”, “host”, “country”, and “state_province” in that order so that duplicates should be pre-clustered. This would make it easier for me to gauge how well dedup
performed at different tolerance levels. I then read the data into R keeping only the following headers: genus, epithet, scientific_name, recorded_by, year, month, day, host, host_family, cultivated, native_status, continent, country, state_province.
I ran dedup
to remove duplicates. I used a subset of the records to fine tune the tolerance parameter. Specifically, there are ~60 records of Basidiophora entospora collected by A. B. Seymour, but he only made 22 collections of that species. I found that dedup(North_American_downy_mildews_tidy, how="one", tolerance=1)
correctly reduced the number of records from ~60 to 22. However, I discovered that I accidentally sorted by “recorded_by_id” instead of “recorded_by”. So, I re-sorted the data and re-ran the script. This produced minimal changes in the order of the records. However, this time dedup(North_American_downy_mildews_tidy, how="one", tolerance=1)
reduced the number of Basidiophora entospora collected by A. B. Seymour from ~60 to 13!
Why would changing the order of the records produce such a difference in output from dedup
?