{"id":13044,"date":"2025-12-30T03:00:00","date_gmt":"2025-12-30T08:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=13044"},"modified":"2025-12-22T13:48:41","modified_gmt":"2025-12-22T18:48:41","slug":"counting-files-and-words-from-the-command-line","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=13044","title":{"rendered":"Counting files and words from the command line"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"13044\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p>On another article-based website that I manage, all of the content is stored as <em>files<\/em>. Using files for everything was a convenience when I first set up the website several years ago, but it has turned out to be a fast and secure way to run the website. There is no database to hack, and the server overhead is very low when it\u2019s just serving a collection of HTML-formatted files.<\/p>\n\n\n\n<p>This also makes it really easy to calculate how much I wrote over any given year. With a few commands, we can count how many articles I wrote, and how many words in those articles. Let\u2019s demonstrate by looking back at 2025 to see what I did:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-many-articles\">How many articles<\/h2>\n\n\n\n<p>First, I need to separate <em>what I wrote<\/em> from <em>what others wrote<\/em>. Because the website uses files for everything, I can use standard Linux commands to figure this out.<\/p>\n\n\n\n<p>To separate the articles, it\u2019s important to know a little about how the data is organized on the system. Each article is saved in a separate directory in a date-formatted path. For example, for an article that ran on July 1, the path to that article would be <code>2025\/07\/01\/article<\/code>. The <code>article<\/code> directory contains a few other files, including a file called <code>2025\/07\/01\/article\/content.html<\/code> that contains the article text, and a file called <code>2025\/07\/01\/article\/author<\/code> that lists the authors who contributed to an article.<\/p>\n\n\n\n<p>The <code>author<\/code> file is almost always a single line, because articles usually have just one author. But sometimes multiple people might contribute to an article, which means we need to cite more than one author. Each author\u2019s username is listed as a separate line in the <code>author<\/code> file. That means I can use this file to determine <em>what I wrote<\/em> from <em>what others wrote<\/em>.<\/p>\n\n\n\n<p>To get a list of articles that <em>I wrote<\/em>, I can run the <strong>find<\/strong> command to look for all of the <code>author<\/code> files, and use <strong>grep<\/strong> to search for my username. If my name is there, then I wrote it; if not, someone else wrote it.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ find 2025\/ -type f -name author -exec grep -q jhall {} \\; -print &gt; jhall.list<\/code><\/pre>\n\n\n\n<p>If you haven\u2019t used <strong>find<\/strong> before, it may seem there\u2019s a lot going on in this command line, so let\u2019s walk through it:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <strong>-type f<\/strong> option says to look for <em>files<\/em><\/li>\n\n\n\n<li>Adding the <strong>-name author<\/strong> option says to look for files called <code>author<\/code><\/li>\n\n\n\n<li>The <strong>-exec<\/strong> option tells <strong>find<\/strong> what to do when it matches a file; in this case, it runs <strong>grep<\/strong> with some options<\/li>\n\n\n\n<li>The <strong>{}<\/strong> braces are a placeholder for the matching filename<\/li>\n\n\n\n<li>Use <strong>;<\/strong> to terminate the <strong>-exec<\/strong> statement (because this is a special character to Bash, I\u2019ve \u201cescaped\u201d it)<\/li>\n\n\n\n<li>The <strong>-print<\/strong> option prints the matching filename; since this comes after an <strong>-exec<\/strong> statement, the filename will only be printed if the <strong>grep<\/strong> command succeeds<\/li>\n<\/ul>\n\n\n\n<p>After this command, the <code>jhall.list<\/code> file has a list of entries; each is separate path to an <code>author<\/code> file, and the <code>author<\/code> file has my username in it. I can use the <strong>wc<\/strong> command with the <strong>-l<\/strong> option to count the <em>lines<\/em> in this file, to see that I wrote 34 articles:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ wc -l jhall.list \n34 jhall.list<\/code><\/pre>\n\n\n\n<p>I can run the same command with <strong>grep -v<\/strong> to \u201cinvert\u201d the search, and print only a list of <code>author<\/code> files that <em>do not<\/em> contain my username; these are articles that <em>others wrote<\/em>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ find 2025\/ -type f -name author -exec grep -q -v jhall {} \\; -print &gt; others.list\n$ wc -l others.list\n43 others.list<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-many-words\">How many words<\/h2>\n\n\n\n<p>I\u2019m also curious to see <em>how much I wrote<\/em>. For that, I need to examine the <code>content.html<\/code> file for each article. Counting words in this file will be <em>close<\/em> to the article count, although not exact. For articles with paragraphs and simple formatting, the word count should be pretty close, although not exact. But for my needs, this is <em>close enough<\/em>.<\/p>\n\n\n\n<p>To count the words <em>that I wrote<\/em>, I need to run the <strong>wc<\/strong> command for every article written by me. I don\u2019t have that list of article content, but I can get it by editing the list I already have.<\/p>\n\n\n\n<p>The body text for each article is stored in the <code>content.html<\/code> file. The <code>jhall.list<\/code> file contains a list of paths to the <code>author<\/code> files, for articles that I wrote. For example, this might be <code>2025\/07\/01\/article\/author<\/code> for an article published on July 1. If we replace the word <code>author<\/code> with <code>content.html<\/code>, we will end up with a list of the HTML content files. The <strong>sed<\/strong> command can make that replacement for us, using the <strong>s<\/strong> edit instruction to replace or \u201cswap\u201d the string <code>author<\/code> (the <strong>$<\/strong> means \u201cat the end of a line\u201d) with <code>content.html<\/code>, for each line in the file:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ sed -e 's\/author$\/content.html\/' jhall.list  &gt; jhall_content.list\n$ sed -e 's\/author$\/content.html\/' others.list  &gt; others_content.list<\/code><\/pre>\n\n\n\n<p>To process each <code>content.html<\/code> file with the <strong>wc<\/strong> command, I can run <strong>wc<\/strong> against the list of files. But for a very long list, this might \u201coverload\u201d the command line with too many files. Instead, use the <strong>xargs<\/strong> command to run a command against each file in the list:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ xargs wc -w --total=only &lt; jhall_content.list\n40611\n$ xargs wc -w --total=only &lt; others_content.list\n29474<\/code><\/pre>\n\n\n\n<p>The <strong>\u2013total=only<\/strong> option is a GNU <strong>wc<\/strong> extension to only print the total, and nothing else. Without it, <strong>wc<\/strong> would also print the word count for each file in the list.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"using-the-command-line\">Using the command line<\/h2>\n\n\n\n<p>With just the <strong>find<\/strong> and <strong>wc<\/strong> commands, I can see that we ran 77 articles on that website. I wrote 34 of the articles, or just under half; other contributors wrote 43 articles. And by adding <strong>xargs<\/strong> and <strong>wc<\/strong> commands, I can see that I wrote a total of over 40,000 words in 34 articles, while others wrote a total of 29,000 words across 43 articles. The word count in my articles makes sense because many of my articles included source code samples, and the source code will get included in the word count.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here\u2019s a practical example of how I use the command line to tally how much I wrote this year.<\/p>\n","protected":false},"author":33,"featured_media":3293,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[100,5],"tags":[104,91],"class_list":["post-13044","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-command-line","category-linux","tag-command-line","tag-linux"],"modified_by":"Jim Hall","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/13044","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13044"}],"version-history":[{"count":2,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/13044\/revisions"}],"predecessor-version":[{"id":13046,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/13044\/revisions\/13046"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/3293"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13044"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13044"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13044"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}