Question about `dedup` in scrubr

wjdavis90 · February 10, 2021, 8:09pm

Hello,
I am working on cleaning up herbarium specimen records before I use them for later analyses. I tried to use the dedup function in the scrubr package, but I have run into an unusual replication problem.

My dataset consists of ~18000 rows. I pre-sorted the data by the columns “genus”, “epithet”, recorded_by_id, “year”, “month”, “day”, “host”, “country”, and “state_province” in that order so that duplicates should be pre-clustered. This would make it easier for me to gauge how well dedup performed at different tolerance levels. I then read the data into R keeping only the following headers: genus, epithet, scientific_name, recorded_by, year, month, day, host, host_family, cultivated, native_status, continent, country, state_province.

I ran dedup to remove duplicates. I used a subset of the records to fine tune the tolerance parameter. Specifically, there are ~60 records of Basidiophora entospora collected by A. B. Seymour, but he only made 22 collections of that species. I found that dedup(North_American_downy_mildews_tidy, how="one", tolerance=1) correctly reduced the number of records from ~60 to 22. However, I discovered that I accidentally sorted by “recorded_by_id” instead of “recorded_by”. So, I re-sorted the data and re-ran the script. This produced minimal changes in the order of the records. However, this time dedup(North_American_downy_mildews_tidy, how="one", tolerance=1) reduced the number of Basidiophora entospora collected by A. B. Seymour from ~60 to 13!

Why would changing the order of the records produce such a difference in output from dedup?

sckott · February 10, 2021, 8:56pm

Thanks for your question!

Can you share some of the data, including the rows that have duplicates so that I can try to reproduce the issue? If you can’t share publicly you can DM me here with the file

The work of comparing records is done by qlcMatrix::sim.strings - so I imagine will need to see how that is done to answer this

wjdavis90 · February 11, 2021, 2:56pm

Thanks!
Yes, I can share the data; it’s public anyway. I am also willing to share the R script. I can upload these to github and provide a link? (I’m not sure the best way to share data on here.) Do you want the whole dataset or just the rows I was using to gauge the success of dedup?

wjdavis90 · February 11, 2021, 3:37pm

Hello,
I uploaded the data and the scripts here: legendary-chainsaw/downy_midlews at main · wjdavis90/legendary-chainsaw · GitHub

sckott · February 11, 2021, 11:42pm

perfect, thanks. i’ll have a look soon

sckott · March 17, 2021, 9:37pm

Sorry took me so long to get back to you.

I downloaded files from that repo.

I ran through both 20210209-7_script.R and 20210209-8_script.R, then subset to records with

dplyr::filter(w, cur_scientific_name == "Basidiophora entospora", recorded_by == "A B Seymour")

However, I included recorded_by_id, but that column is always empty. So I’m not sure how it would affect the dedup() operation. Maybe you can share a completely reproducible example with the subset with Basidiophora entospora/A B Seymour for me to test

Topic		Replies	Views
Cleaning occurrence data - your feedback wanted! Package Use Questions geospatial , biodiversity	0	1065	September 20, 2016
Taxonomic databases from R Package Use Questions sql , taxize , taxonomy	2	1711	March 24, 2016
Cleaning data in R General Q&A community	2	589	October 27, 2021
Combining occurrence data from rgbif::occ_data/occ_search UseCases r , rgbif	0	1378	April 12, 2019
Collecting tidy data General Q&A	9	2050	December 11, 2018

Question about `dedup` in scrubr

Related topics