Counting words from online articles
I recently wrote a week-long series of articles for “DOScember,” highlighting some things you can do with FreeDOS such as editing files with an Emacs-like editor, programming in BASIC, and listening to music. When I was done, I was curious: how much did I write?
One way that I could tally how much I wrote in that article series is to copy and paste each article into a word processor like LibreOffice, and add up the word counts for each article to get a total. And for just seven articles, that might not take too long. But I prefer to follow the Linux philosophy and automate things where I can. Here’s how I did it using the command line.
Start with a list
The first step was the only manual one: I copied the URLs for each article and pasted them into a text file, with each URL on a separate line:
https://www.both.org/?p=12944
https://www.both.org/?p=12955
https://www.both.org/?p=12967
https://www.both.org/?p=12971
https://www.both.org/?p=12976
https://www.both.org/?p=12978
https://www.both.org/?p=12986
I saved it as list, which is a very plain and obvious filename.
Convert to plain text
Linux includes the wc (“word count”) command, which is a standard Unix utility that counts words, lines, and characters in a text file. You can process one or more files at a time; if you examine more than one file, wc will also print a total word count, which is what I wanted.
But websites transmit data as HTML, which includes a lot of extra markup that will skew my word count. Instead, I preferred to process the articles in plain text so I could get a more accurate word count for each.
For that, I used pandoc, an open source universal document converter that can convert from and to all kinds of formats, including from HTML and to plain text. If you don’t have pandoc installed on your system already, you can install it using your distribution’s package manager, such as dnf on Fedora:
$ sudo dnf install pandoc
A neat feature in pandoc is that it can read directly from a website, without having to fetch the HTML document separately. For example, I can use this command to convert just one of the URLs to plain text:
$ pandoc --from html --to plain https://www.both.org/?p=12986 -o f.txt
This saves the content of the Edit text with this Emacs-like editor article into a plain text file called f.txt with 971 words in it:
$ wc -w f.txt
971 f.txt
Count the overhead
The plain text file is a complete copy of the text components from the web page, including the website’s header and footer. You can see the extra “header” text by using the head command to view the first 20 lines:
$ head -20 f.txt
Home
Edit text with this Emacs-like editor
[]
FreeDOS Fun Text Editors
Edit text with this Emacs-like editor
[]
Jim Hall
December 27, 2025December 16, 2025
On Linux, I often use the GNU Emacs editor to write the source code for
new programs. I learned GNU Emacs long ago when I was an undergraduate
student, and I still have the "finger memory" for all the keyboard
shortcuts.
Similarly, you can see the extra “footer” text by using the tail command, although I found it easier to use the less command to view the file interactively. I discovered that my text was between the “December 27, 2025” article date and the “Bluesky” link, which is the first link in the website’s footer.
Using the awk command, I stripped out the extra header and footer text, to count just the text from my article:
awk '/December 27, 2025/ {p=1} /Bluesky/ {p=0} p==1 {print}' f.txt | wc -w
760
This is a three-part awk script written on one line: each instruction is written in pattern-action pairs. The first pair sets the variable p to the value 1 when it matches the text “December 27, 2025” on a line. The second pair sets p to 0 when it finds the line with “Bluesky” in it. The third pair prints any line whenever the value of p is 1.
This effectively prints just my text from the article, and uses the wc command to count the words. In the end, my article’s word count was about 760 words. Subtracting 760 from 971 (the total word count) means the “overhead” is about 181 words:
$ expr 971 - 790
181
Count all files at once
Now that I know the “overhead” for each article, I can tally the total words across all of my articles in the “DOScember” series. I used a for loop at the Bash prompt to convert each article in the list to a plain text file, then used wc to count the words:
$ n=1; for url in $(cat list); do pandoc --from html --to plain $url -o article$((n++)).txt; done
$ wc -w article?.txt
973 article1.txt
976 article2.txt
1019 article3.txt
1013 article4.txt
1009 article5.txt
1804 article6.txt
971 article7.txt
7765 total
The Bash line has several neat features that I should explain:
- Use
$( )for command substitution, to expand a command line into a list. For example, using$(cat list)includes the list of URLs in the file namedlist - The
$(( ))performs arithmetic expansion. Bash variables are expanded withing having to use an extra$, so$((n++))means “print the current value ofnthen increment it by one” which allows the for loop to increment the value of thenvariable as it processes each URL
Adjust the count by the “overhead”
The total word count is actually off by about 181 words for each article. To calculate the total without the “overhead,” I can use the wc command in a loop, and subtract the extra 181 from each article’s word count while working out the total:
$ total=0; for f in article?.txt; do count=$(wc -w --total=only $f); total=$((total + count - 181)); done; echo $total
6498
Again, this Bash line uses command substitution and arithmetic expansion. The $(wc -w --total=only $f) command substitution runs the wc command to count words from the file in the $f variable; the --total=only option is a GNU wc extension to only print the total, not the filename. The running total uses $((total + count - 181)) to add the word count for each article, minus the extra “overhead.”
In the end, I found that I had written about 6,498 words for all seven articles. Doing this from the command line was very easy, requiring only the pandoc command to convert the web pages to plain text, and wc to count the words. A few clever Bash commands later, and I had the total.
I had my answer in about a minute, including the time to copy and paste the list of URLs, and figuring out what commands to run. In contrast, copying and pasting each article into a word processor, counting the words in each, and using a calculator or spreadsheet to tally the total word counts would have taken much longer.