
I need a list of URLs
I teach a few university courses, and a former student recently emailed me to ask if she could get a copy of all the links from our “course links” page. She forgot to save the links during the Spring semester, and now that the course has ended, she can’t access the course to get to the links.
My course had a lot of online readings, so this was not a small request. Fortunately, I was able to solve it by typing a few Unix commands. Here’s how I did it:
Break up the text into words
The first step was pretty easy: save a copy of the “course links” page as an HTML file, in a temporary file I saved as /tmp/f
. This gave me the content for the page, which included weekly reminders, topic overviews, and links to online resources (the “weekly readings”).
I only wanted to get a list of the URLs in this page. In HTML, each link is contained in the <a>
“anchor” tag, with a hyper reference, so the link actually looks like:
<a href="https://example.com/">link text</a>
Our course links were a bit more complicated, because Canvas adds some other information to links, so each link actually looked like:
<a class="inline_disabled" href="https://example.com/" target="_blank" rel="noopener">link text</a>
I only needed the href
part for each <a>
tag, because that has the URL for each link. One way to do this would be to process the HTML with a one-off program to print just the tags (anything that starts with <
and ends with >
). But I realized there was an easier way: break up the text into words by turning each space character into a newline character. Then I would have each “word” on a line by itself. More importantly, each link would become a series of separate lines, like this:
<a
class="inline_disabled"
href="https://example.com/"
target="_blank"
rel="noopener">link
text</a>
This task calls for the tr command, which translates one character into another character:
$ tr ' ' '\n' < /tmp/f
Find the links
With every word from the HTML file on a separate line, I can search for any line that starts with href=
. These should be the URLs for each link in the page. In theory, this might catch some other text in the file, but it’s unlikely because I wrote the page and I know I don’t have any other text that would contain href=
.
The grep command is made for this task. This will search its input and only print lines that match a pattern, which could be a simple string. For example, this will print all lines that have href=
at the start of a line:
$ tr ' ' '\n' < /tmp/f | grep '^href='
However, that will return lines that contain not just the URL, but the href=
and quotes around the URL. That means each URL will be printed like this:
href="https://example.com/"
I wanted the list to look nice for my former student, which meant I wanted to remove the unneeded text.
The awk command is a simple but flexible scripting language that makes it easy to pull apart a line into fields that you can act upon. Normally, the field separator is any kind of white space, like a tab or space character. But you can also specify your own field separator with the -F
option.
If I assume the quote mark is a field separator, then the three fields are: href=
and https://example.com/
and an empty field at the end.
That was my solution: for any line that starts with https=
, use the quote mark as the field separator, and print the second field:
$ tr ' ' '\n' < /tmp/f | awk -F\" '/^href=/ {print $2}'
Print only unique URLs
To make the list easy to read, I decided to sort the output for my student. To do that, you can use the sort command, which does as the name suggests:
$ tr ' ' '\n' < /tmp/f | awk -F\" '/^href=/ {print $2}' | sort
This also identified another issue: sometimes, I listed the same link in different weeks on the “course links” page. I wanted to make the list as short as possible, which meant I only wanted to print one instance of each unique link. The last step was to pass the output through the uniq command, which examines a sorted list and omits repeated lines, giving a list of unique lines:
$ tr ' ' '\n' < /tmp/f | awk -F\" '/^href=/ {print $2}' | sort | uniq
This gave the list of URLs that I could share with my former student. This was quite a long list, but as I said at the beginning: my course had a lot of online readings. We can see the list has 65 URLs by adding the wc command, which counts characters, words, and lines .. but we can print only the line count with the -l
option:
$ tr ' ' '\n' < /tmp/f | awk -F\" '/^href=/ {print $2}' | sort | uniq | wc -l
65
Just a few commands
This task is a tough one if you try to do it the “manual” way, by reading through the page, then right-clicking on each link, selecting “Copy” from the pop-up menu, and pasting that URL into an email. For 65 links, that’s going to take a long time. Let’s make the math easy and say it takes ten seconds to find a link, evaluate if I’ve copied the link before, then copy and paste the link into an email. That’s 650 seconds to copy all of the links on the page, or over ten minutes.
But by starting with the tenet that everything should be in an open format like HTML, and embracing the command line by typing a few standard Unix commands, this becomes an easy task: tr splits lines into words, which I can then process with awk to print URLs, then sort them with sort, and strip out repeats with uniq. I used the command line to do in seconds what would have needed more than ten minutes to do by hand.