{"id":7083,"date":"2024-08-24T03:00:00","date_gmt":"2024-08-24T07:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=7083"},"modified":"2024-08-16T12:33:39","modified_gmt":"2024-08-16T16:33:39","slug":"check-spelling-at-the-command-line","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=7083","title":{"rendered":"Check spelling at the command line"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"7083\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p class=\"wp-block-paragraph\">I like to write a lot of articles about Linux, FreeDOS, programming, and open source software. I am pretty confident of my spelling and grammar, but sometimes it\u2019s nice to run a quick spell-check to see if I have any spelling errors in my document. When I write documents in LibreOffice, I can use the spell-checker that\u2019s built into LibreOffice, or I can rely on the \u201cred squiggle underline\u201d to catch spelling errors as I type.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"770\" height=\"572\" src=\"https:\/\/www.both.org\/wp-content\/uploads\/2024\/08\/typo.png\" alt=\"\" class=\"wp-image-7084\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">But I write most of my articles in Markdown. If I\u2019m using Vim, I can turn on the automatic inline spell-checker using this Vim command:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>:set spell spelllang=en_us<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And that does a great job of catching spelling errors as I type. But I prefer to type without distraction, and capture my thoughts quickly. Going back to fix typos <em>while I type<\/em> is very distracting for me. Instead, I wanted to have a command line tool where I could run a quick command to check for any spelling errors in my document <em>after I\u2019d finished writing it<\/em>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"original-spell-checkers\">Original spell-checkers<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The original Unix introduced the <code>typo<\/code> command in Unix 3rd Edition. The <code>spell<\/code> command was added in its place starting with Unix 6th Edition. Each command does basically the same thing: checks every word in a document, and prints a sorted list of unique misspelled words.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As an undergraduate, our campus lab of Sun computers had <code>ispell<\/code>, which was an interactive version of <code>spell<\/code> that showed any misspelled words <em>in context<\/em> in the text document, and suggested correct spellings. These days, <a href=\"http:\/\/aspell.net\/\">GNU Aspell<\/a> replaces <code>ispell<\/code> for checking spelling at the command line on Linux and other Unix-like systems.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"building-your-own-spell-checker\">Building your own spell-checker<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">You can build your own version of the original <code>typo<\/code> command using a few Linux commands. At a high level, you can do this by breaking apart a document into <em>words<\/em>, sorting that list of words and removing duplicates, then comparing that list to a list of <em>correctly spelled words<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It\u2019s easiest to convert all words to lowercase, and that\u2019s how the original <code>typo<\/code> command worked. To convert all text to lowercase, use the <code>tr<\/code> (transliterate) command, replacing all uppercase letters with lowercase letters:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>tr 'A-Z' 'a-z'<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Next, remove all punctuation from the input. You can also use <code>tr<\/code> to do this, with the <code>-d<\/code> (delete) option:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>tr -d '.,:;\"?!@()'<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then, break up the text so that each word appears on its own line. A simple way to do this is with the <code>tr<\/code> command, and convert spaces to newlines:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>tr ' ' '\\n'<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">From there, you can use <code>sort<\/code> to sort the list, and <code>uniq<\/code> to remove any duplicates:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sort | uniq<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The last step uses the <code>comm<\/code> (common lines) command to compare two files: the <em>list of words from the document<\/em> with <em>another list of correctly spelled words<\/em>. The <code>comm<\/code> program assumes both lists are sorted in the same way, and it produces output where the lines unique to the first file are in one column, lines unique to the second fiel are in a second column, and the lines common to both appear in a third column.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When comparing lists of words, that means the correctly spelled words (words that appear in <em>both<\/em> the document <em>and<\/em> the list of correctly spelled words) will be in column 3, while misspelled words (words that appear <em>only<\/em> in the document) will be in column 1. To display only column 1 (misspelled words) we need to disable columns 2 and 3 with the <code>-2<\/code> and <code>-3<\/code> options:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>comm -2 -3 - $words<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To put that all together, the full command line looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cat \"$@\" | tr 'A-Z' 'a-z' | tr -d '.,:;\"?!@()' | tr ' ' '\\n' | sort | uniq | comm -2 -3 - $words<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This requires a sorted list of correctly spelled words. Every Unix-like system should have this list saved as <code>\/usr\/share\/dict\/words<\/code>, but the list may not be sorted in the same way that the <code>sort<\/code> command would generate, so I like to work with a local copy. My full <code>typo<\/code> script looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/bin\/bash\nwords=$HOME\/lib\/words.tmp\n\n&#91; -f $words ] || sort \/usr\/share\/dict\/words &gt; $words\n\ncat \"$@\" | tr 'A-Z' 'a-z' | tr -d '.,:;\"?!@()' | tr ' ' '\\n' | sort | uniq | comm -2 -3 - $words<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s test it! Let\u2019s say I had this one-line document called <code>test.md<\/code> that had a single misspelled word:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>This is a sample document with a mspelled word.<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If I run the <code>typo<\/code> script against this file, I get the one misspelled word as the only output:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ typo test.md\nmspelled<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"a-streamlined-version\">A streamlined version<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">That <code>typo<\/code> script works well for me, but I\u2019ve experimented with other ways to implement it. The basic steps remain the same, but I wanted to use the <em>character class<\/em> model from GNU <code>tr<\/code> to do the same thing. One way is to start is with the <code>-c<\/code> (complement) option to convert any character that is <em>not<\/em> a letter into a newline character:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>tr -c '&#91;:alpha:]' '\\n'<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then I wanted to immediately reduce the work for following steps by removing the blank lines. The <code>grep<\/code> command can do this easily:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>grep -v '^$'<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The remaining steps are much as they were in the original. The script converts uppercase letters to lowercase letters, this time using the <code>[:upper:]<\/code> and <code>[:lower:]<\/code> character classes instead of <code>A-Z<\/code> and <code>a-z<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>tr '&#91;:upper:]' '&#91;:lower:]'<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, sort the words with <code>sort<\/code> and remove duplicates with <code>uniq<\/code> before comparing the output with <code>comm<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sort | uniq | comm -2 -3 - $words<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The full command line looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cat \"$@\" | tr -c '&#91;:alpha:]' '\\n' | grep -v '^$' | \\\n tr '&#91;:upper:]' '&#91;:lower:]' | sort | uniq | comm -2 -3 - $words<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I followed the same basic steps to prepare the sorted list of correctly spelled words, based on the <code>\/usr\/share\/dict\/words<\/code> file. To accommodate any words that I use but aren\u2019t in the system list, such as when I write about FreeDOS, I combine a list of my own words saved in <code>mywords<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cat \/usr\/share\/dict\/words $HOME\/lib\/mywords | \\\n tr '&#91;:upper:]' '&#91;:lower:]' | sort | uniq &gt; $words<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">My new <code>typo<\/code> script looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/bin\/bash\nwords=$HOME\/lib\/words.tmp\n\nif &#91; ! -f $words ] ; then\n cat \/usr\/share\/dict\/words $HOME\/lib\/mywords | \\\n  tr '&#91;:upper:]' '&#91;:lower:]' | sort | uniq &gt; $words\nfi\n\ncat \"$@\" | tr -c '&#91;:alpha:]' '\\n' | grep -v '^$' | \\\n tr '&#91;:upper:]' '&#91;:lower:]' | sort | uniq | comm -2 -3 - $words<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The script does <em>mostly<\/em> the same job as the previous <code>typo<\/code> script. For example, it finds the same misspelled word from the previous example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ typo test.md \nmspelled<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The limitation in this \u201cimproved\u201d version is that the <code>tr -c<\/code> command removes hyphenation and apostrophes, so words like <code>hadn't<\/code> will get split up into <code>hadn<\/code> and <code>t<\/code>, resulting in <code>hadn<\/code> being identified as a \u201cmisspelled\u201d word, despite the original word being listed in the correctly spelled words. However, as a quick spell-check tool, this works well enough for me.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Check spelling at the command line like old-school Unix with this cool script.<\/p>\n","protected":false},"author":33,"featured_media":3868,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[100,5],"tags":[104,91],"class_list":["post-7083","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-command-line","category-linux","tag-command-line","tag-linux"],"modified_by":"David Both","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/7083","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7083"}],"version-history":[{"count":4,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/7083\/revisions"}],"predecessor-version":[{"id":7089,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/7083\/revisions\/7089"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/3868"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7083"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7083"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7083"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}