{"id":13156,"date":"2026-01-07T03:00:00","date_gmt":"2026-01-07T08:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=13156"},"modified":"2025-12-29T18:59:58","modified_gmt":"2025-12-29T23:59:58","slug":"counting-words-from-online-articles","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=13156","title":{"rendered":"Counting words from online articles"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"13156\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p>I recently wrote a week-long series of articles for &#8220;DOScember,&#8221; highlighting some things you can do with FreeDOS such as editing files with an Emacs-like editor, programming in BASIC, and listening to music. When I was done, I was curious: how much did I write?<\/p>\n\n\n\n<p>One way that I could tally how much I wrote in that article series is to copy and paste <em>each article<\/em> into a word processor like LibreOffice, and add up the word counts for each article to get a total. And for just seven articles, that might not take too long. But I prefer to follow the Linux philosophy and <a href=\"https:\/\/www.both.org\/?p=7228\">automate things where I can<\/a>. Here&#8217;s how I did it using the command line.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"start-with-a-list\">Start with a list<\/h2>\n\n\n\n<p>The first step was the only manual one: I copied the URLs for each article and pasted them into a text file, with each URL on a separate line:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">https:\/\/www.both.org\/?p=12944<br>https:\/\/www.both.org\/?p=12955<br>https:\/\/www.both.org\/?p=12967<br>https:\/\/www.both.org\/?p=12971<br>https:\/\/www.both.org\/?p=12976<br>https:\/\/www.both.org\/?p=12978<br>https:\/\/www.both.org\/?p=12986<\/pre>\n\n\n\n<p>I saved it as <code>list<\/code>, which is a very plain and obvious filename.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"convert-to-plain-text\">Convert to plain text<\/h2>\n\n\n\n<p>Linux includes the <strong>wc<\/strong> (&#8220;word count&#8221;) command, which is a standard Unix utility that counts words, lines, and characters in a text file. You can process one or more files at a time; if you examine more than one file, <strong>wc<\/strong> will also print a total word count, which is what I wanted.<\/p>\n\n\n\n<p>But websites transmit data as HTML, which includes a lot of extra markup that will skew my word count. Instead, I preferred to process the articles in plain text so I could get a more accurate word count for each.<\/p>\n\n\n\n<p>For that, I used <strong>pandoc<\/strong>, an open source <a href=\"https:\/\/pandoc.org\/\">universal document converter<\/a> that can convert from and to all kinds of formats, including <em>from HTML<\/em> and <em>to plain text<\/em>. If you don&#8217;t have <strong>pandoc<\/strong> installed on your system already, you can install it using your distribution&#8217;s package manager, such as <strong>dnf<\/strong> on Fedora:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ sudo dnf install pandoc<\/code><\/pre>\n\n\n\n<p>A neat feature in <strong>pandoc<\/strong> is that it can read directly from a website, without having to fetch the HTML document separately. For example, I can use this command to convert just one of the URLs to plain text:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ pandoc --from html --to plain https:\/\/www.both.org\/?p=12986 -o f.txt<\/code><\/pre>\n\n\n\n<p>This saves the content of the <a href=\"https:\/\/www.both.org\/?p=12986\">Edit text with this Emacs-like editor<\/a> article into a plain text file called <code>f.txt<\/code> with 971 words in it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ wc -w f.txt\n971 f.txt<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"count-the-overhead\">Count the overhead<\/h2>\n\n\n\n<p>The plain text file is a complete copy of the text components from the web page, including the website&#8217;s header and footer. You can see the extra &#8220;header&#8221; text by using the <strong>head<\/strong> command to view the first 20 lines:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ head -20 f.txt\nHome\n\nEdit text with this Emacs-like editor\n\n&#91;]\n\nFreeDOS Fun Text Editors\n\nEdit text with this Emacs-like editor\n\n&#91;]\n\nJim Hall\n\nDecember 27, 2025December 16, 2025\n\nOn Linux, I often use the GNU Emacs editor to write the source code for\nnew programs. I learned GNU Emacs long ago when I was an undergraduate\nstudent, and I still have the \"finger memory\" for all the keyboard\nshortcuts.<\/code><\/pre>\n\n\n\n<p>Similarly, you can see the extra &#8220;footer&#8221; text by using the <strong>tail<\/strong> command, although I found it easier to use the <strong>less<\/strong> command to view the file interactively. I discovered that <em>my text<\/em> was between the &#8220;December 27, 2025&#8221; article date and the &#8220;Bluesky&#8221; link, which is the first link in the website&#8217;s footer.<\/p>\n\n\n\n<p>Using the <strong>awk<\/strong> command, I stripped out the extra header and footer text, to count just the text from my article:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>awk '\/December 27, 2025\/ {p=1} \/Bluesky\/ {p=0} p==1 {print}' f.txt | wc -w\n760<\/code><\/pre>\n\n\n\n<p>This is a three-part <strong>awk<\/strong> script written on one line: each instruction is written in <a href=\"https:\/\/www.both.org\/?p=10802\">pattern-action pairs<\/a>. The first pair sets the variable <code>p<\/code> to the value 1 when it matches the text &#8220;December 27, 2025&#8221; on a line. The second pair sets <code>p<\/code> to 0 when it finds the line with &#8220;Bluesky&#8221; in it. The third pair prints any line whenever the value of <code>p<\/code> is 1.<\/p>\n\n\n\n<p>This effectively prints just my text from the article, and uses the <strong>wc<\/strong> command to count the words. In the end, my article&#8217;s word count was about 760 words. Subtracting 760 from 971 (the total word count) means the &#8220;overhead&#8221; is about 181 words:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ expr 971 - 790\n181<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"count-all-files-at-once\">Count all files at once<\/h2>\n\n\n\n<p>Now that I know the &#8220;overhead&#8221; for each article, I can tally the total words across all of my articles in the &#8220;DOScember&#8221; series. I used a <strong>for<\/strong> loop at the Bash prompt to convert each article in the list to a plain text file, then used <strong>wc<\/strong> to count the words:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ n=1; for url in $(cat list); do pandoc --from html --to plain $url -o article$((n++)).txt; done\n$ wc -w article?.txt\n  973 article1.txt\n  976 article2.txt\n 1019 article3.txt\n 1013 article4.txt\n 1009 article5.txt\n 1804 article6.txt\n  971 article7.txt\n 7765 total<\/code><\/pre>\n\n\n\n<p>The Bash line has several neat features that I should explain:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <code>$( )<\/code> for <em>command substitution<\/em>, to expand a command line into a list. For example, using <code>$(cat list)<\/code> includes the list of URLs in the file named <code>list<\/code><\/li>\n\n\n\n<li>The <code>$(( ))<\/code> performs <em>arithmetic expansion<\/em>. Bash variables are expanded withing having to use an extra <code>$<\/code>, so <code>$((n++))<\/code> means &#8220;print the current value of <code>n<\/code> then increment it by one&#8221; which allows the <strong>for<\/strong> loop to increment the value of the <code>n<\/code> variable as it processes each URL<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"adjust-the-count-by-the-overhead\">Adjust the count by the &#8220;overhead&#8221;<\/h2>\n\n\n\n<p>The total word count is actually off by about 181 words for each article. To calculate the total without the &#8220;overhead,&#8221; I can use the <strong>wc<\/strong> command in a loop, and subtract the extra 181 from each article&#8217;s word count while working out the total:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ total=0; for f in article?.txt; do count=$(wc -w --total=only $f); total=$((total + count - 181)); done; echo $total\n6498<\/code><\/pre>\n\n\n\n<p>Again, this Bash line uses command substitution and arithmetic expansion. The <code>$(wc -w --total=only $f)<\/code> command substitution runs the <strong>wc<\/strong> command to count words from the file in the <code>$f<\/code> variable; the <code>--total=only<\/code> option is a GNU <strong>wc<\/strong> extension to only print the total, not the filename. The running total uses <code>$((total + count - 181))<\/code> to add the word count for each article, minus the extra &#8220;overhead.&#8221;<\/p>\n\n\n\n<p>In the end, I found that I had written about 6,498 words for all seven articles. Doing this from the command line was very easy, requiring only the <strong>pandoc<\/strong> command to convert the web pages to plain text, and <strong>wc<\/strong> to count the words. A few clever Bash commands later, and I had the total.<\/p>\n\n\n\n<p>I had my answer in about a minute, including the time to copy and paste the list of URLs, and figuring out what commands to run. In contrast, copying and pasting <em>each article<\/em> into a word processor, counting the words in each, and using a calculator or spreadsheet to tally the total word counts would have taken much longer.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here&#8217;s how I automated counting words from a series of articles.<\/p>\n","protected":false},"author":33,"featured_media":4464,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[100,5],"tags":[104,91],"class_list":["post-13156","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-command-line","category-linux","tag-command-line","tag-linux"],"modified_by":"Jim Hall","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/13156","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13156"}],"version-history":[{"count":7,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/13156\/revisions"}],"predecessor-version":[{"id":13164,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/13156\/revisions\/13164"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/4464"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13156"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13156"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13156"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}