Counting almost-duplicates in very long lists – Six Colors

Feuding Families and the Upgradies

In the past couple of weeks, two different projects of mine have been released that were powered, at least in part, by a Python script that eliminated enormous amounts of labor from a process that used to take hours of drudgery.

Both The Upgradies and Feuding Families rely on compiling a list of most common answers from hundreds of submissions into a free-entry box in a Google form. As you might expect, this leads to some pretty inconsistent data entry. Poll people about their favorite Apple product of the year and you’ll get Mini, Mac mini, M4 Mac mini, The Mini, The Mac Mini, The New Mac Mini, and even things like Macmini and Mca mini and Macini.

I started down the path of automating because I just thought computers would do a better job of counting identical input than humans would. And that’s true, but the more I thought about it, the more I wanted the tool to go beyond counting identical entries—I wanted it to throw all the similar entries into the count as well. Why not?

And so I created the first iteration of this tool, which Myke Hurley and I have been using for our projects for a year or two. It’s a Python script that’s just inserted into a one-line Shortcut (run shell script, since you can choose Python from a list of shells) for convenience’s sake.

All the original script does is read the clipboard, puts everything in title case (thereby avoiding differences in capitalization), and then strips out a bunch of extraneous spaces and the addition of “The” via regular expressions.

Once all of that is normalized a bit, it’s run through a pretty amazing python command called Counter, contained within the collections package. It takes a list and returns an array with the number of times each list item appears. My script processes it, using the most_common sorting technique, and formats it fancy for export. And with that, a data set of one, two, three, four, four, four, five, one becomes:

Four 3
One 2
Two 1
Three 1
Five 1

So that’s good, but it still requires quite a bit of merging items that aren’t quite close enough to be caught and normalized by my small stack of regular expressions. Enter Six Colors member Adrian, who suggested using the Levenshtein method to match similar strings to one another. There’s even a Python package that does the trick.

So I decided to recreate my script using Adrian’s suggested approach. After normalizing the list by removing articles, punctuation, extra spaces, and different cases, my script loops through the list and uses the Levenshtein ratio to decide if a string is close enough to be considered part of the larger group.

Then I brought in ChatGPT to do the dirty work of formatting and sorting the output in a way that was pleasing to me. The result is that a data set of one, two, three, threee, five, one hundred, a hundred, one hundreed, theree, four, fore, five, one becomes:

Three 3
One Hundred 3
One 2
Five 2
Four 2
Two 1

I tested this all with actual Feuding Families data. Here’s the result of a real list of nearly 700 poll submissions answering the question “Name a bounty hunter in Star Wars”:

Boba Fett 335
Ig-88 86
The Mandalorian 41
Bossk 36
Mando 34
Greedo 27
Jango Fett 27
Ig-11 13

Not perfect, I’m going to have to manually merge The Mandalorian with Mando, but that’s it! And it managed to merge Baba Fet, Babo Fett, Bob A Fett, Bob Fett, Boba Fet, Boba Fety, Bobafett, Bobba Fet, Bobba Fett, Bobs Fett, Boda Feta, Bona Fett, and Bubba Fett together into a single “Boba Fett” answer.

Finally, the Mac-friendly integration: By saving this as a Shortcut and opting for it to appear in the Services Menu, because the script accepts input and generates output, I can actually use this script in pretty much any Mac text editor. I just select the text of the list, choose Count Duplicates in List from the Services submenu under the Apple Menu, and the uncounted list will be replaced with a counted one.

Shortcut

Anyway, if you ever find yourself in the very specific need of processing a bunch of similar, but not identical, answers, the script is available as a gist on GitHub.

If you appreciate articles like this one, support us by becoming a Six Colors subscriber. Subscribers get access to an exclusive podcast, members-only stories, and a special community.

Related Posts