
Straight quotes from pandoc
I like to write the first draft of anything using Markdown. For example, I might write an article (like this one) in Markdown, then later I can convert my Markdown to something else using pandoc.
When writing articles, I usually want to convert my output to HTML so I can easily copy and paste the generated HTML into the web content management system. The standard way to do this is with this pandoc command line:
$ pandoc --from markdown --to html file.md -o file.html
This creates a “bare” version of an HTML document, without the HTML structure like <html>
or <body>
to make it complete. If I want to generate a technically valid HTML document, I need to add the --standalone
(or -s
) option to pandoc like this:
$ pandoc -s --from markdown --to html file.md -o file.html
Unfortunately, pandoc assumes UTF-8 encoding for the output. That means that any quotes or apostrophes in my input will be turned into “curly” quotes or apostrophes in the output. You can see this for yourself if you start with this sample input file, which I’ve saved as spaceballs.md
:
Dark Helmet said, "I am your father's brother's nephew's cousin's former roommate."
If you process this into HTML with pandoc, you’ll see “curly” quotes and apostrophes:
$ pandoc --from markdown --to html spaceballs.md
<p>Dark Helmet said, “I am your father’s brother’s nephew’s cousin’s
former roommate.”</p>
If you want to only use plain ASCII for the output, add the --ascii=true
option to pandoc. This generates numeric-code HTML entities intended for Unicode:
$ pandoc --ascii=true --from markdown --to html spaceballs.md > spaceballs.htm
$ cat spaceballs.htm
<p>Dark Helmet said, “I am your father’s brother’s nephew’s cousin’s
former roommate.”</p>
When copying HTML into my web content system, I prefer to use plain old quotes and apostrophes instead of these HTML entities. There isn’t a way (that I know of) to do that directly from pandoc, but this is easily done using a separate tool. The sed command is a standard Unix command, and is available on every Linux system. sed is a stream editor, which means it reads data from a file or standard input, prints to standard output, and edits as it goes. This makes it ideal to use with pandoc to convert the HTML entities to plain quotes.
To use sed, you provide it with one or more edit commands to execute (-e
). sed supports several editing commands, but the one we need is s to make simple changes to text. The syntax for the s command requires you to provide a regular expression pattern that sed should match, and the text to replace it with. For example, to change the word true
to false
from the file input.txt
and save the edits as output.txt
, you would write:
$ sed -e "s/true/false/g" input.txt > output.txt
The g at the end makes this global so that the edit statement works on more than one matching instance of true
on a line.
Using this, we can match any instance of ’
as a “curly” apostrophe and replace it with a straight apostrophe like this:
$ sed -e "s/\’/'/g" spaceballs.htm
<p>Dark Helmet said, “I am your father's brother's nephew's cousin's
former roommate.”</p>
We need to add the backslash before the apostrophe in the matching pattern because an apostrophe otherwise has special meaning for a regular expression. Using the backslash tells sed to look for an actual apostrophe in the input.
Use the same process to change the “curly” quotes to straight quotes:
$ sed -e "s/\’/'/g" -e 's/\“/"/' -e 's/\”/"/' spaceballs.htm
<p>Dark Helmet said, "I am your father's brother's nephew's cousin's
former roommate."</p>
The use of different kinds of quotes is to get around the single quotes and double quotes for the replacement text. Linux normally uses single quotes and double quotes to enclose text that might have spaces in it. Without the quotes around each statement, the first single quote in s/\’/'/g
wouldn’t “pair” with another single quote, so the command wouldn’t work. We need to enclose the single quote edit with double quotes, and the double quote edit with single quotes.
Using sed is a simple but flexible way to edit text. I use it whenever I need to make frequent but simple changes to a text file, such as this example to change “curly” quotes and apostrophes to plain ASCII quotes and apostrophes. To make this easier to use, consider writing a one-line script to combine both pandoc and the sed command:
#!/bin/bash
pandoc --ascii=true --from markdown --to html "$@" | sed -e "s/\’/'/g" -e 's/\“/"/' -e 's/\”/"/'
I’ve saved this on my system as markdown-html
. Then, whenever I need to convert Markdown to HTML, I just run this script to get my output in plain ASCII:
$ markdown-html spaceballs.md
<p>Dark Helmet said, "I am your father's brother's nephew's cousin's
former roommate."</p>