{"id":10906,"date":"2025-06-16T02:00:00","date_gmt":"2025-06-16T06:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=10906"},"modified":"2025-06-10T11:14:46","modified_gmt":"2025-06-10T15:14:46","slug":"i-need-a-list-of-urls","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=10906","title":{"rendered":"I need a list of URLs"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"10906\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p>I teach a few university courses, and a former student recently emailed me to ask if she could get a copy of all the links from our &#8220;course links&#8221; page. She forgot to save the links during the Spring semester, and now that the course has ended, she can&#8217;t access the course to get to the links.<\/p>\n\n\n\n<p>My course had a lot of online readings, so this was not a small request. Fortunately, I was able to solve it by typing a few Unix commands. Here&#8217;s how I did it:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"break-up-the-text-into-words\">Break up the text into words<\/h2>\n\n\n\n<p>The first step was pretty easy: save a copy of the &#8220;course links&#8221; page as an HTML file, in a temporary file I saved as <code>\/tmp\/f<\/code>. This gave me the content for the page, which included weekly reminders, topic overviews, and links to online resources (the &#8220;weekly readings&#8221;).<\/p>\n\n\n\n<p>I only wanted to get a list of the URLs in this page. In HTML, each link is contained in the <code>&lt;a&gt;<\/code> &#8220;anchor&#8221; tag, with a hyper reference, so the link actually looks like:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&lt;a href=\"https:\/\/example.com\/\"&gt;link text&lt;\/a&gt;<\/code><\/pre>\n\n\n\n<p>Our course links were a bit more complicated, because Canvas adds some other information to links, so each link actually looked like:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&lt;a class=\"inline_disabled\" href=\"https:\/\/example.com\/\" target=\"_blank\" rel=\"noopener\"&gt;link text&lt;\/a&gt;<\/code><\/pre>\n\n\n\n<p>I only needed the <code>href<\/code> part for each <code>&lt;a&gt;<\/code> tag, because that has the URL for each link. One way to do this would be to process the HTML with a one-off program to print just the tags (anything that starts with <code>&lt;<\/code> and ends with <code>&gt;<\/code>). But I realized there was an easier way: break up the text into words by turning each space character into a newline character. Then I would have each &#8220;word&#8221; on a line by itself. More importantly, each link would become a series of separate lines, like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&lt;a\nclass=\"inline_disabled\"\nhref=\"https:\/\/example.com\/\"\ntarget=\"_blank\"\nrel=\"noopener\"&gt;link\ntext&lt;\/a&gt;<\/code><\/pre>\n\n\n\n<p>This task calls for the <strong>tr<\/strong> command, which <em>translates<\/em> one character into another character:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ tr ' ' '\\n' &lt; \/tmp\/f<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"find-the-links\">Find the links<\/h2>\n\n\n\n<p>With every word from the HTML file on a separate line, I can search for any line that starts with <code>href=<\/code>. These should be the URLs for each link in the page. In theory, this might catch some other text in the file, but it&#8217;s unlikely because I wrote the page and I <em>know<\/em> I don&#8217;t have any other text that would contain <code>href=<\/code>.<\/p>\n\n\n\n<p>The <strong>grep<\/strong> command is made for this task. This will search its input and only print lines that match a pattern, which could be a simple string. For example, this will print all lines that have <code>href=<\/code> at the start of a line:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ tr ' ' '\\n' &lt; \/tmp\/f | grep '^href='<\/code><\/pre>\n\n\n\n<p>However, that will return lines that contain not just the URL, but the <code>href=<\/code> and quotes around the URL. That means each URL will be printed like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>href=\"https:\/\/example.com\/\"\n<\/code><\/pre>\n\n\n\n<p>I wanted the list to look nice for my former student, which meant I wanted to remove the unneeded text.<\/p>\n\n\n\n<p>The <strong>awk<\/strong> command is a simple but flexible scripting language that makes it easy to pull apart a line into <em>fields<\/em> that you can act upon. Normally, the field <em>separator<\/em> is any kind of white space, like a tab or space character. But you can also specify your own field separator with the <code>-F<\/code> option.<\/p>\n\n\n\n<p>If I assume the quote mark is a field separator, then the three fields are: <code>href=<\/code> and <code>https:\/\/example.com\/<\/code> and an empty field at the end.<\/p>\n\n\n\n<p>That was my solution: for any line that starts with <code>https=<\/code>, use the quote mark as the field separator, and print the second field:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ tr ' ' '\\n' &lt; \/tmp\/f | awk -F\\\" '\/^href=\/ {print $2}'<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"print-only-unique-urls\">Print only unique URLs<\/h2>\n\n\n\n<p>To make the list easy to read, I decided to sort the output for my student. To do that, you can use the <strong>sort<\/strong> command, which does as the name suggests:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ tr ' ' '\\n' &lt; \/tmp\/f | awk -F\\\" '\/^href=\/ {print $2}' | sort<\/code><\/pre>\n\n\n\n<p>This also identified another issue: sometimes, I listed the same link in different weeks on the &#8220;course links&#8221; page. I wanted to make the list as short as possible, which meant I only wanted to print <em>one instance<\/em> of each unique link. The last step was to pass the output through the <strong>uniq<\/strong> command, which examines a sorted list and omits repeated lines, giving a list of unique lines:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ tr ' ' '\\n' &lt; \/tmp\/f | awk -F\\\" '\/^href=\/ {print $2}' | sort | uniq<\/code><\/pre>\n\n\n\n<p>This gave the list of URLs that I could share with my former student. This was quite a long list, but as I said at the beginning: <em>my course had a lot of online readings<\/em>. We can see the list has 65 URLs by adding the <strong>wc<\/strong> command, which counts characters, words, and lines .. but we can print only the line count with the <code>-l<\/code> option:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ tr ' ' '\\n' &lt; \/tmp\/f | awk -F\\\" '\/^href=\/ {print $2}' | sort | uniq | wc -l\n65<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"just-a-few-commands\">Just a few commands<\/h2>\n\n\n\n<p>This task is a tough one if you try to do it the &#8220;manual&#8221; way, by reading through the page, then right-clicking on each link, selecting &#8220;Copy&#8221; from the pop-up menu, and pasting that URL into an email. For 65 links, that&#8217;s going to take a long time. Let&#8217;s make the math easy and say it takes ten seconds to <em>find a link<\/em>, evaluate if I&#8217;ve copied the link before, then <em>copy<\/em> and <em>paste<\/em> the link into an email. That&#8217;s 650 seconds to copy all of the links on the page, or over ten minutes.<\/p>\n\n\n\n<p>But by starting with the tenet that <a href=\"https:\/\/www.both.org\/?p=8003\">everything should be in an open format<\/a> like HTML, and <a href=\"https:\/\/www.both.org\/?p=7030\">embracing the command line<\/a> by typing a few standard Unix commands, this becomes an easy task: <strong>tr<\/strong> splits lines into words, which I can then process with <strong>awk<\/strong> to print URLs, then sort them with <strong>sort<\/strong>, and strip out repeats with <strong>uniq<\/strong>. I used the command line to do in <em>seconds<\/em> what would have needed more than ten minutes to do by hand.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I used the command line to do in seconds what would have needed more than ten minutes to do by hand.<\/p>\n","protected":false},"author":33,"featured_media":3293,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[100,5],"tags":[104,91],"class_list":["post-10906","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-command-line","category-linux","tag-command-line","tag-linux"],"modified_by":"Jim Hall","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/10906","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=10906"}],"version-history":[{"count":1,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/10906\/revisions"}],"predecessor-version":[{"id":10907,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/10906\/revisions\/10907"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/3293"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=10906"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=10906"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=10906"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}