{"id":10877,"date":"2025-06-09T02:00:00","date_gmt":"2025-06-09T06:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=10877"},"modified":"2025-06-04T13:42:33","modified_gmt":"2025-06-04T17:42:33","slug":"straight-quotes-from-pandoc","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=10877","title":{"rendered":"Straight quotes from pandoc"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"10877\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p class=\"wp-block-paragraph\">I like to write the first draft of anything using Markdown. For example, I might write an article (like this one) in Markdown, then later I can convert my Markdown to something else using <strong>pandoc<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When writing articles, I usually want to convert my output to HTML so I can easily copy and paste the generated HTML into the web content management system. The standard way to do this is with this <strong>pandoc<\/strong> command line:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ pandoc --from markdown --to html file.md -o file.html<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This creates a \u201cbare\u201d version of an HTML document, without the HTML structure like <code>&lt;html&gt;<\/code> or <code>&lt;body&gt;<\/code> to make it <em>complete<\/em>. If I want to generate a <em>technically valid<\/em> HTML document, I need to add the <code>--standalone<\/code> (or <code>-s<\/code>) option to <strong>pandoc<\/strong> like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ pandoc -s --from markdown --to html file.md -o file.html<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Unfortunately, <strong>pandoc<\/strong> assumes UTF-8 encoding for the output. That means that any quotes or apostrophes in my input will be turned into \u201ccurly\u201d quotes or apostrophes in the output. You can see this for yourself if you start with this sample input file, which I\u2019ve saved as <code>spaceballs.md<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Dark Helmet said, \"I am your father's brother's nephew's cousin's former roommate.\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If you process this into HTML with <strong>pandoc<\/strong>, you\u2019ll see \u201ccurly\u201d quotes and apostrophes:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ pandoc --from markdown --to html spaceballs.md\n&lt;p&gt;Dark Helmet said, \u201cI am your father\u2019s brother\u2019s nephew\u2019s cousin\u2019s\nformer roommate.\u201d&lt;\/p&gt;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If you want to only use plain ASCII for the output, add the <code>--ascii=true<\/code> option to <strong>pandoc<\/strong>. This generates numeric-code HTML entities intended for Unicode:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ pandoc --ascii=true --from markdown --to html spaceballs.md &gt; spaceballs.htm\n\n$ cat spaceballs.htm\n&lt;p&gt;Dark Helmet said, &amp;#x201C;I am your father&amp;#x2019;s brother&amp;#x2019;s nephew&amp;#x2019;s cousin&amp;#x2019;s\nformer roommate.&amp;#x201D;&lt;\/p&gt;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">When copying HTML into my web content system, I prefer to use plain old quotes and apostrophes instead of these HTML entities. There isn\u2019t a way (that I know of) to do that directly from <strong>pandoc<\/strong>, but this is easily done using a separate tool. The <strong>sed<\/strong> command is a standard Unix command, and is available on every Linux system. <strong>sed<\/strong> is a <em>stream editor<\/em>, which means it reads data from a file or <em>standard input<\/em>, prints to <em>standard output<\/em>, and edits as it goes. This makes it ideal to use with <strong>pandoc<\/strong> to convert the HTML entities to plain quotes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To use <strong>sed<\/strong>, you provide it with one or more edit commands to execute (<code>-e<\/code>). <strong>sed<\/strong> supports several editing commands, but the one we need is <strong>s<\/strong> to make simple changes to text. The syntax for the <strong>s<\/strong> command requires you to provide a <em>regular expression<\/em> pattern that <strong>sed<\/strong> should match, and the text to replace it with. For example, to change the word <code>true<\/code> to <code>false<\/code> from the file <code>input.txt<\/code> and save the edits as <code>output.txt<\/code>, you would write:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ sed -e \"s\/true\/false\/g\" input.txt > output.txt<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>g<\/strong> at the end makes this <em>global<\/em> so that the edit statement works on more than one matching instance of <code>true<\/code> on a line.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Using this, we can match any instance of <code>&amp;#x2019;<\/code> as a \u201ccurly\u201d apostrophe and replace it with a straight apostrophe like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ sed -e \"s\/\\&amp;#x2019;\/'\/g\" spaceballs.htm\n&lt;p&gt;Dark Helmet said, &amp;#x201C;I am your father's brother's nephew's cousin's\nformer roommate.&amp;#x201D;&lt;\/p&gt;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We need to add the backslash before the apostrophe in the matching pattern because an apostrophe otherwise has special meaning for a regular expression. Using the backslash tells <strong>sed<\/strong> to look for an actual apostrophe in the input.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Use the same process to change the \u201ccurly\u201d quotes to straight quotes:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ sed -e \"s\/\\&amp;#x2019;\/'\/g\" -e 's\/\\&amp;#x201C;\/\"\/' -e 's\/\\&amp;#x201D;\/\"\/' spaceballs.htm\n&lt;p&gt;Dark Helmet said, \"I am your father's brother's nephew's cousin's\nformer roommate.\"&lt;\/p&gt;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The use of different kinds of quotes is to get around the single quotes and double quotes for the replacement text. Linux normally uses single quotes and double quotes to enclose text that might have spaces in it. Without the quotes around each statement, the first single quote in <code>s\/\\&amp;#x2019;\/'\/g<\/code> wouldn\u2019t \u201cpair\u201d with another single quote, so the command wouldn\u2019t work. We need to enclose the <em>single quote edit<\/em> with double quotes, and the <em>double quote edit<\/em> with single quotes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Using <strong>sed<\/strong> is a simple but flexible way to edit text. I use it whenever I need to make frequent but simple changes to a text file, such as this example to change \u201ccurly\u201d quotes and apostrophes to plain ASCII quotes and apostrophes. To make this easier to use, consider writing a one-line script to combine both <strong>pandoc<\/strong> and the <strong>sed<\/strong> command:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/bin\/bash\npandoc --ascii=true --from markdown --to html \"$@\" | sed -e \"s\/\\&amp;#x2019;\/'\/g\" -e 's\/\\&amp;#x201C;\/\"\/' -e 's\/\\&amp;#x201D;\/\"\/'<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I\u2019ve saved this on my system as <code>markdown-html<\/code>. Then, whenever I need to convert Markdown to HTML, I just run this script to get my output in plain ASCII:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ markdown-html spaceballs.md \n&lt;p&gt;Dark Helmet said, \"I am your father's brother's nephew's cousin's\nformer roommate.\"&lt;\/p&gt;<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Change pandoc&#8217;s curly quotes into straight quotes with this script.<\/p>\n","protected":false},"author":33,"featured_media":3683,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":"","_members_access_role":[],"_members_access_error":""},"categories":[100,5,590],"tags":[104,91],"class_list":["post-10877","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-command-line","category-linux","category-pandoc","tag-command-line","tag-linux"],"modified_by":"Jim Hall","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/10877","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=10877"}],"version-history":[{"count":2,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/10877\/revisions"}],"predecessor-version":[{"id":10879,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/10877\/revisions\/10879"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/3683"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=10877"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=10877"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=10877"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}