Counting files and words from the command line
On another article-based website that I manage, all of the content is stored as files. Using files for everything was a convenience when I first set up the website several years ago, but it has turned out to be a fast and secure way to run the website. There is no database to hack, and the server overhead is very low when it’s just serving a collection of HTML-formatted files.
This also makes it really easy to calculate how much I wrote over any given year. With a few commands, we can count how many articles I wrote, and how many words in those articles. Let’s demonstrate by looking back at 2025 to see what I did:
How many articles
First, I need to separate what I wrote from what others wrote. Because the website uses files for everything, I can use standard Linux commands to figure this out.
To separate the articles, it’s important to know a little about how the data is organized on the system. Each article is saved in a separate directory in a date-formatted path. For example, for an article that ran on July 1, the path to that article would be 2025/07/01/article. The article directory contains a few other files, including a file called 2025/07/01/article/content.html that contains the article text, and a file called 2025/07/01/article/author that lists the authors who contributed to an article.
The author file is almost always a single line, because articles usually have just one author. But sometimes multiple people might contribute to an article, which means we need to cite more than one author. Each author’s username is listed as a separate line in the author file. That means I can use this file to determine what I wrote from what others wrote.
To get a list of articles that I wrote, I can run the find command to look for all of the author files, and use grep to search for my username. If my name is there, then I wrote it; if not, someone else wrote it.
$ find 2025/ -type f -name author -exec grep -q jhall {} \; -print > jhall.list
If you haven’t used find before, it may seem there’s a lot going on in this command line, so let’s walk through it:
- The -type f option says to look for files
- Adding the -name author option says to look for files called
author - The -exec option tells find what to do when it matches a file; in this case, it runs grep with some options
- The {} braces are a placeholder for the matching filename
- Use ; to terminate the -exec statement (because this is a special character to Bash, I’ve “escaped” it)
- The -print option prints the matching filename; since this comes after an -exec statement, the filename will only be printed if the grep command succeeds
After this command, the jhall.list file has a list of entries; each is separate path to an author file, and the author file has my username in it. I can use the wc command with the -l option to count the lines in this file, to see that I wrote 34 articles:
$ wc -l jhall.list
34 jhall.list
I can run the same command with grep -v to “invert” the search, and print only a list of author files that do not contain my username; these are articles that others wrote:
$ find 2025/ -type f -name author -exec grep -q -v jhall {} \; -print > others.list
$ wc -l others.list
43 others.list
How many words
I’m also curious to see how much I wrote. For that, I need to examine the content.html file for each article. Counting words in this file will be close to the article count, although not exact. For articles with paragraphs and simple formatting, the word count should be pretty close, although not exact. But for my needs, this is close enough.
To count the words that I wrote, I need to run the wc command for every article written by me. I don’t have that list of article content, but I can get it by editing the list I already have.
The body text for each article is stored in the content.html file. The jhall.list file contains a list of paths to the author files, for articles that I wrote. For example, this might be 2025/07/01/article/author for an article published on July 1. If we replace the word author with content.html, we will end up with a list of the HTML content files. The sed command can make that replacement for us, using the s edit instruction to replace or “swap” the string author (the $ means “at the end of a line”) with content.html, for each line in the file:
$ sed -e 's/author$/content.html/' jhall.list > jhall_content.list
$ sed -e 's/author$/content.html/' others.list > others_content.list
To process each content.html file with the wc command, I can run wc against the list of files. But for a very long list, this might “overload” the command line with too many files. Instead, use the xargs command to run a command against each file in the list:
$ xargs wc -w --total=only < jhall_content.list
40611
$ xargs wc -w --total=only < others_content.list
29474
The –total=only option is a GNU wc extension to only print the total, and nothing else. Without it, wc would also print the word count for each file in the list.
Using the command line
With just the find and wc commands, I can see that we ran 77 articles on that website. I wrote 34 of the articles, or just under half; other contributors wrote 43 articles. And by adding xargs and wc commands, I can see that I wrote a total of over 40,000 words in 34 articles, while others wrote a total of 29,000 words across 43 articles. The word count in my articles makes sense because many of my articles included source code samples, and the source code will get included in the word count.