{"id":7565,"date":"2024-09-15T03:00:00","date_gmt":"2024-09-15T07:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=7565"},"modified":"2024-09-10T20:31:13","modified_gmt":"2024-09-11T00:31:13","slug":"searching-text-files-from-the-command-line","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=7565","title":{"rendered":"Searching text files from the command line"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"7565\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p class=\"wp-block-paragraph\">I like to write articles about a variety of topics, and I am a regular contributor to several websites. I also manage a few websites; one is about IT leadership, another is about open source software. In these websites, the web content management system is <em>file-based<\/em>, which means it stores articles and some metadata as files under a special <em>data area<\/em>. The files are combined into web pages when a reader views an article. Storing the content as files has several advantages, including the ability to view the files directly on the server, and to use standard Linux tools to examine them.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And recently, I did exactly that. I was asked to contribute an article from one of the websites as a \u201cchapter\u201d in a book. We have a lot of articles on the website, but I wanted to examine only the articles <em>written by me<\/em>, and to identify the articles that had the longest word count. If I could generate the list of my longest articles <em>as a list of URLs<\/em>, I could click on each to read it; then I could decide which article to contribute to the book. Here\u2019s how I did that.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-the-files-are-stored\">How the files are stored<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The <em>data area<\/em> of the content management system is organized more or less in parallel to how users access the website\u2014although it\u2019s a different area that\u2019s not directly accessible on the web. Articles are sorted by date and topic, each as directories.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the \u201ctopic\u201d directory, the content management system keeps a few metadata files, including a file called <code>author<\/code> that lists who wrote the article. The article content itself is saved in a file called <code>content.html<\/code>. That means an article called \u201cleadership\u201d that was published on August 1, 2022 will be stored in a path like <code>2022\/08\/01\/leadership<\/code> and will have a file called <code>contents.html<\/code> and another called <code>author<\/code>, plus other metadata files.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"finding-files\">Finding files<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">For my example, I wanted to generate a list of URLs to the longest articles I had written on the website. That meant I needed to search the <code>author<\/code> for my email address, and then count the words in the accompanying <code>content.html<\/code> file.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Searching for a list of all <code>author<\/code> files is pretty straightforward using the <strong>find<\/strong> command. This is an extremely flexible tool to identify files in your filesystem that match certain rules. The basic usage is:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ find {path} {rules} {actions}<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>find<\/strong> command has a bunch of rules you can apply, but I only needed a few of them. The <code>-type f<\/code> rule will match regular files, while the <code>-name author<\/code> rule will match an entry called <code>author<\/code>. Use both together to find regular files named <code>author<\/code>. The default action is to print the entries. For example, to simply list the files in the <em>data area<\/em> named <code>author<\/code>, I might run this command:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ find data\/ -type f -name author<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"counting-words\">Counting words<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">However, I wanted to take a second action: I needed to count the words in the <code>contents.html<\/code> file. I can\u2019t do that directly using the <strong>find<\/strong> command, but I figured the easiest way around that was to write a short Bash script. This script accepted the name of an <code>author<\/code> file, and searched it for my email address. If that matched, then it used the path to the <code>author<\/code> file to find the <code>contents.html<\/code> file, and used <strong>wc<\/strong> to count the words. I saved this short script as <code>wc.bash<\/code> in my home directory:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/bin\/bash\nd=$(dirname $1)\ngrep -q jhall $1\nif &#91; $? -eq 0 ] ; then\n    words=$(cat $d\/content.html | wc -w)\n    echo $words https:\/\/www.example.com\/$d\nfi<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>dirname<\/strong> command prints the <em>path<\/em> to a file, which allows the script to locate the matching <code>content.html<\/code> file that goes with the <code>author<\/code> file.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I used <strong>wc<\/strong> indirectly, using <strong>cat<\/strong>, because I wanted to print a URL to the article on the website. That required a separate <strong>echo<\/strong> command to print the word count plus the URL.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With this script, I was able to update my <strong>find<\/strong> command with a new action: for each real file named <code>author<\/code>, execute the <code>wc.bash<\/code> script with the <code>author<\/code> file. Note that the <code>-exec<\/code> action uses <code>{}<\/code> as a placeholder for the matching entry, and requires <code>;<\/code> to terminate the action. Because <code>;<\/code> is also a special shell character (to separate commands on a single command line) I needed to protect it by writing it as <code>\\;<\/code> instead.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ find 202&#91;34] -type f -name author -exec bash $HOME\/wc.bash {} \\; | sort -nr | head -5<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This <strong>find<\/strong> command line looks in both the 2023 and 2024 directories for any real files called <code>author<\/code>, then runs the <code>wc.bash<\/code> script. The script checks the <code>author<\/code> file for my email address; if I\u2019m the author, the script then uses <strong>wc<\/strong> to count the words in <code>content.html<\/code> and prints the result as a URL.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The output of the <strong>find<\/strong> actions then get sent to the <strong>sort<\/strong> command to sort by numbers (<code>-n<\/code>) in <em>reverse order<\/em> (<code>-r<\/code>), and print only the first 5 results. These are the 5 longest articles written by me, with the word count and a URL to the article:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ find 202&#91;34] -type f -name author -exec bash $HOME\/wc.bash {} \\; | sort -nr | head -5\n1985 https:\/\/www.example.com\/2024\/02\/12\/workplans\n1712 https:\/\/www.example.com\/2024\/05\/20\/resume\n1706 https:\/\/www.example.com\/2024\/06\/17\/interview\n1668 https:\/\/www.example.com\/2024\/06\/10\/digitaltwins\n1461 https:\/\/www.example.com\/2024\/02\/05\/smallbusiness<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"searching-text-files\">Searching text files<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>find<\/strong> command is a powerful and flexible tool to locate files under a path. You can use it just to print the matching filenames, or use <code>-exec<\/code> like I did to perform some secondary actions. Every Unix-like system should include <strong>find<\/strong>, as it\u2019s a core part of Unix since 1st Ed (November \u201971). Learn more about <strong>find<\/strong> in the online manual page, in section 1:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ man 1 find<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>The find command is a powerful and flexible tool to locate files under a path.<\/p>\n","protected":false},"author":33,"featured_media":2973,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":"","_members_access_role":[],"_members_access_error":""},"categories":[100,5],"tags":[104,564,91],"class_list":["post-7565","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-command-line","category-linux","tag-command-line","tag-find","tag-linux"],"modified_by":"David Both","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/7565","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7565"}],"version-history":[{"count":2,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/7565\/revisions"}],"predecessor-version":[{"id":7567,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/7565\/revisions\/7567"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/2973"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7565"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7565"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7565"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}