{"id":10802,"date":"2025-05-30T03:00:00","date_gmt":"2025-05-30T07:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=10802"},"modified":"2025-05-28T20:47:42","modified_gmt":"2025-05-29T00:47:42","slug":"extracting-text-with-awk","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=10802","title":{"rendered":"Extracting text with awk"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"10802\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p class=\"wp-block-paragraph\">The <strong>awk<\/strong> script interpreter is a very handy tool for systems administrators, and anyone else who uses Linux at the command line. With awk, you can solve a tricky problem with a 1-line script.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Recently, I needed to extract content from an HTML document. The details of the HTML page are not important, only that I wanted to get just the <em>body text<\/em> from the document. In HTML, the <em>body<\/em> is defined by the <code>&lt;body&gt;<\/code> and <code>&lt;\/body&gt;<\/code> tags, in a larger document that might look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><a href=\"file:\/\/\/home\/jhall\/Documents\/markdown\/awk.html#cb1-1\"><\/a>&lt;!DOCTYPE html&gt;\n<a href=\"file:\/\/\/home\/jhall\/Documents\/markdown\/awk.html#cb1-2\"><\/a>&lt;html lang=\"en\"&gt;\n<a href=\"file:\/\/\/home\/jhall\/Documents\/markdown\/awk.html#cb1-3\"><\/a>  &lt;head&gt;\n<a href=\"file:\/\/\/home\/jhall\/Documents\/markdown\/awk.html#cb1-4\"><\/a>    &lt;title&gt;...&lt;\/title&gt;\n<a href=\"file:\/\/\/home\/jhall\/Documents\/markdown\/awk.html#cb1-5\"><\/a>    ...\n<a href=\"file:\/\/\/home\/jhall\/Documents\/markdown\/awk.html#cb1-6\"><\/a>  &lt;\/head&gt;\n<a href=\"file:\/\/\/home\/jhall\/Documents\/markdown\/awk.html#cb1-7\"><\/a>  &lt;body&gt;\n<a href=\"file:\/\/\/home\/jhall\/Documents\/markdown\/awk.html#cb1-8\"><\/a>\n<a href=\"file:\/\/\/home\/jhall\/Documents\/markdown\/awk.html#cb1-9\"><\/a>  ...\n<a href=\"file:\/\/\/home\/jhall\/Documents\/markdown\/awk.html#cb1-10\"><\/a>\n<a href=\"file:\/\/\/home\/jhall\/Documents\/markdown\/awk.html#cb1-11\"><\/a>  &lt;\/body&gt;\n<a href=\"file:\/\/\/home\/jhall\/Documents\/markdown\/awk.html#cb1-12\"><\/a>&lt;\/html&gt;<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"a-sample-html-file\">A sample HTML file<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s say I have a very simple HTML document that contains only a few lines of text in the body. I can simulate this with the <strong>pandoc<\/strong> command, which generates over 170 lines of text (mostly stylesheet information):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | wc -l\n177<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">But the <code>&lt;body&gt;<\/code> section is at the end of that long file. We can preview it by sending the output through the <strong>tail<\/strong> command:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | tail\n    .display.math{display: block; text-align: center; margin: 0.5rem auto;}\n  &lt;\/style&gt;\n&lt;\/head&gt;\n&lt;body&gt;\n&lt;header id=\"title-block-header\"&gt;\n&lt;h1 class=\"title\"&gt;Hello there&lt;\/h1&gt;\n&lt;\/header&gt;\n&lt;p&gt;Hello there!&lt;\/p&gt;\n&lt;\/body&gt;\n&lt;\/html&gt;<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"extracting-the-text\">Extracting the text<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In my case, I wanted to print only the lines <em>between<\/em> the <code>&lt;body&gt;<\/code> and <code>&lt;\/body&gt;<\/code> tags. I decided to write a short awk script to do this. To write it, I applied a simple pattern-and-action pairing:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>When awk finds <code>&lt;body><\/code> or <code>&lt;\/body><\/code>, increment a counter variable called <code>body<\/code><\/li>\n\n\n\n<li>This means the <code>body<\/code> variable will be zero for any content before <code>&lt;body><\/code><\/li>\n\n\n\n<li>.. and <code>body<\/code> will have the value <strong>1<\/strong> for content between <code>&lt;body><\/code> and <code>&lt;\/body><\/code><\/li>\n\n\n\n<li>.. and <code>body<\/code> will have the value <strong>2<\/strong> for any content after <code>&lt;\/body><\/code><\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">To print the content between <code>&lt;body&gt;<\/code> and <code>&lt;\/body&gt;<\/code>, the awk script needs to print lines when the variable <code>body<\/code> is equal to <strong>1<\/strong>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | awk '\/&lt;\\\/?body&gt;\/ {body++} body==1 {print}'\n&lt;body&gt;\n&lt;header id=\"title-block-header\"&gt;\n&lt;h1 class=\"title\"&gt;Hello there&lt;\/h1&gt;\n&lt;\/header&gt;\n&lt;p&gt;Hello there!&lt;\/p&gt;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">For what I needed to do, that was good enough. I didn\u2019t mind that the script did not print the trailing <code>&lt;\/body&gt;<\/code> tag. Unfortunately, there\u2019s no simple way to modify <em>this<\/em> version of the script to either print <em>both<\/em> tags, or print <em>no<\/em> tags. If we swap the <code>body++<\/code> and <code>print<\/code> statements, we won\u2019t print the <code>&lt;body&gt;<\/code> tag but will print <code>&lt;\/body&gt;<\/code> at the end:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | awk 'body==1 {print} \/&lt;\\\/?body&gt;\/ {body++}'\n&lt;header id=\"title-block-header\"&gt;\n&lt;h1 class=\"title\"&gt;Hello there&lt;\/h1&gt;\n&lt;\/header&gt;\n&lt;p&gt;Hello there!&lt;\/p&gt;\n&lt;\/body&gt;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If you don\u2019t want to include the <code>&lt;body&gt;<\/code> and <code>&lt;\/body&gt;<\/code> tags, you\u2019ll need to modify this awk script to detect the opening and closing tags separately. Let\u2019s modify this script to set <code>body<\/code> to <strong>1<\/strong> when it finds the <code>&lt;body&gt;<\/code> line, and back to zero when it finds the <code>&lt;\/body&gt;<\/code> line. We can strategically place the <code>print<\/code> statement to not print the tags:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ echo 'Hello there!' | pandoc --from markdown --to html --standalone --metadata title='Hello there' | awk '\/&lt;\\\/body&gt;\/ {body=0} body==1 {print} \/&lt;body&gt;\/ {body=1}'\n&lt;header id=\"title-block-header\"&gt;\n&lt;h1 class=\"title\"&gt;Hello there&lt;\/h1&gt;\n&lt;\/header&gt;\n&lt;p&gt;Hello there!&lt;\/p&gt;<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"awk-is-a-versatile-tool\">Awk is a versatile tool<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This is just one way to use awk to extract text from a document. I find that awk is a versatile tool that I can apply to most problems that involve text. If I need to print or manipulate text based on a pattern, awk is usually the right tool for the job.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The awk script interpreter is a very handy tool for systems administrators, and anyone else who uses Linux<\/p>\n","protected":false},"author":33,"featured_media":3579,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[100,5],"tags":[104,91],"class_list":["post-10802","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-command-line","category-linux","tag-command-line","tag-linux"],"modified_by":"Jim Hall","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/10802","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=10802"}],"version-history":[{"count":1,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/10802\/revisions"}],"predecessor-version":[{"id":10803,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/10802\/revisions\/10803"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/3579"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=10802"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=10802"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=10802"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}