{"id":9089,"date":"2025-01-02T06:16:08","date_gmt":"2025-01-02T11:16:08","guid":{"rendered":"https:\/\/www.both.org\/?p=9089"},"modified":"2025-01-02T06:16:08","modified_gmt":"2025-01-02T11:16:08","slug":"processing-files-with-find-and-xargs","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=9089","title":{"rendered":"Processing files with &#8216;find&#8217; and &#8216;xargs&#8217;"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"9089\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p>I manage several websites, including <a href=\"https:\/\/technicallywewrite.com\/\">Technically We Write<\/a> and <a href=\"https:\/\/coachingbuttons.com\/\">Coaching Buttons<\/a>, and recently I wanted to see how much I had written over the last year. I wanted more than just a count; I also was curious to know how many words I had written for each article, and in total.<\/p>\n\n\n\n<p>I manage these websites using\u00a0 a static website generator, which means the website content is saved in plain text files. That makes the website very fast, but it also lets me use standard Linux commands to examine the content, including the word count. Here&#8217;s how I counted words for the articles I wrote in 2024:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Finding my articles<\/h2>\n\n\n\n<p>The website content is stored in a directory for the year, such as <strong>2024<\/strong> for all articles published during 2024. Every article is saved in its own directory, which also contains some plain text metadata; one file is called <strong>author<\/strong> and contains the author&#8217;s username. My username is <strong>jhall<\/strong>.<\/p>\n\n\n\n<p>With this information, I could look for all files called <strong>author<\/strong> under the <strong>2024<\/strong> directory that contained the text <strong>jhall<\/strong>, using this <strong>find<\/strong> command:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ find 2024 -type f -name author -exec grep -q 'jhall' {} \\; -print<\/code><\/pre>\n\n\n\n<p>The <strong>find<\/strong> command operates on a set of files, but usually on a directory tree, to match files and directories that match a pattern. In this command, <strong>find<\/strong> looks for all files (<code>-type f<\/code>) with a specific name (<code>-name author<\/code>). For each matching file, <strong>find<\/strong> runs the <strong>grep<\/strong> command to look for text in the file (<code>-exec grep -q 'jhall' {} \\;<\/code>).<\/p>\n\n\n\n<p>The <strong>grep<\/strong> command runs silently (<code>-q<\/code>) and returns a &#8220;success&#8221; value if it finds <strong>jhall<\/strong> in the file. Note that <strong>find<\/strong> uses <code>{}<\/code> as a placeholder for whatever filename matches the earlier pattern, and the command executed by the <code>-exec<\/code> action must end with a semicolon, which I&#8217;ve protected from Bash interpretation by using a backslash (<code>\\;<\/code>).<\/p>\n\n\n\n<p>Putting everything together, I can use <strong>find<\/strong> to locate every <strong>author<\/strong> file that contains <strong>jhall<\/strong>, and prints the filename (the <code>-print<\/code> action). With this <strong>find<\/strong> command, and the <strong>wc<\/strong> command to count lines in the output, I can see that I wrote about half of the articles for Technically We Write in 2024:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ cd technicallywewrite\n\n$ find 2024 -type f -name author -exec grep -q 'jhall' {} \\; -print | wc -l\n47\n\n$ find 2024 -type f -name author -print | wc -l\n104<\/code><\/pre>\n\n\n\n<p>Running the same set of commands in the other website, I counted that I wrote a third of the articles for Coaching Buttons:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ cd coachingbuttons\n\n$ find 2024 -type f -name author -exec grep -q 'jhall' {} \\; -print | wc -l\n24\n\n$ find 2024 -type f -name author -print | wc -l\n76<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Counting my words<\/h2>\n\n\n\n<p>For each matching <strong>author<\/strong> file, I wanted to count the words in the article. Each article&#8217;s main content is stored in an HTML file called <strong>content.html<\/strong>, saved in the same directory as the <strong>author<\/strong> metadata file. To get a list of all content files I first save a list of the matching <strong>author<\/strong> files, then replace the <strong>author<\/strong> text with <strong>content.html<\/strong> on each line of the list.<\/p>\n\n\n\n<p>For example, running this command in the Technically We Write website prints all <strong>author<\/strong> files that contain my username, then uses <strong>sed<\/strong> to change <strong>author<\/strong> (but only at the end of a line) to <strong>content.html<\/strong>, before saving the list in a plain text file:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ find 2024 -type f -name author -exec grep -q 'jhall' {} \\; -print | sed -e 's\/author$\/content.html\/' > ~\/tww.list<\/code><\/pre>\n\n\n\n<p>I can also run the same command from the Coaching Buttons website:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ find 2024 -type f -name author -exec grep -q 'jhall' {} \\; -print | sed -e 's\/author$\/content.html\/' > ~\/cb.list<\/code><\/pre>\n\n\n\n<p>To count the total words, I only need to run the <strong>wc<\/strong> command against each file in the list. One way to do that is with the <code>$()<\/code> Bash expansion to print the contents of the list of filenames as options to the <strong>wc<\/strong> command, showing that I wrote almost 59,600 words in 47 articles for Technically We Write in 2024:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ wc -l &lt; ~\/tww.list \n47\n\n$ wc -w $(cat ~\/tww.list) | tail -1\n 59592 total<\/code><\/pre>\n\n\n\n<p>But running a command with a long list of files (especially where each file might have a long path) can overload the command line. To avoid this, the more typical way to run a command with a list from a file is the <strong>xargs<\/strong> command. This runs a command as though you specified each filename on the command line. If the command line gets too long, <strong>xargs<\/strong> can automatically break up the list and run the other command multiple times.<\/p>\n\n\n\n<p>To accommodate possibly running <strong>wc<\/strong> more than once (which will result in multiple <strong>total<\/strong> output lines) I&#8217;ll add the <code>--total=never<\/code> command line option to suppress the total, and pass the output through <strong>gawk<\/strong> to print the sum of the word counts:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ wc -l &lt; ~\/tww.list \n47\n\n$ xargs wc -w --total=never &lt; ~\/tww.list | gawk '{tot += $1} END {print tot}'\n59592<\/code><\/pre>\n\n\n\n<p>Running the same commands from the Coaching Buttons website shows that I wrote over 22,000 words in 24 articles during 2024:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ wc -l &lt; ~\/cb.list \n24\n\n$ xargs wc -w --total=never &lt; ~\/cb.list | gawk '{tot += $1} END {print tot}'\n22047<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Processing files with &#8216;find&#8217; and &#8216;xargs&#8217;<\/h2>\n\n\n\n<p>A core tenet of the Linux Philosophy is to store everything in <a href=\"https:\/\/www.both.org\/?p=8003\">plain text files<\/a>. This makes it easy to work with them using the Linux command line, which provides a ton of useful utilities to process text. Two powerful commands that I used here are <strong>find<\/strong> to locate matching files and directories and print the results as a list, and the <strong>xargs<\/strong> command to run a command against a list of files.<\/p>\n\n\n\n<p>If you look at how I&#8217;ve written my commands, you can see this in action. I used <strong>find<\/strong> to match files that contained my name, and saved the results to a file. Then I used that list with <strong>xargs<\/strong> to count the words in all the files, and print the result. This is made possible because each command <em>does one thing<\/em> and <em>operates on plain text<\/em>, making the overall process a series of small steps.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here&#8217;s how I used &#8216;find&#8217; and &#8216;xargs&#8217; to locate the articles I wrote and count the words.<\/p>\n","protected":false},"author":33,"featured_media":2818,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[100,5],"tags":[104,91],"class_list":["post-9089","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-command-line","category-linux","tag-command-line","tag-linux"],"modified_by":"David Both","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/9089","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=9089"}],"version-history":[{"count":2,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/9089\/revisions"}],"predecessor-version":[{"id":9091,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/9089\/revisions\/9091"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/2818"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=9089"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=9089"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=9089"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}