{"id":5340,"date":"2024-06-01T02:30:00","date_gmt":"2024-06-01T06:30:00","guid":{"rendered":"https:\/\/www.both.org\/?p=5340"},"modified":"2024-05-17T12:36:34","modified_gmt":"2024-05-17T16:36:34","slug":"digging-into-odt-file-contents","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=5340","title":{"rendered":"Digging into ODT file contents"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"5340\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p>I love that open source is built on open standards. One example is LibreOffice. If you aren\u2019t familiar with LibreOffice, it has an <a href=\"https:\/\/www.libreoffice.org\/about-us\/libreoffice-timeline\/\">interesting history<\/a>, which I\u2019ll describe briefly:<\/p>\n\n\n\n<p>In the 1980s, a German company called StarDivision released Star-Writer, a word processor for the CP\/M operating system, and later ported to DOS. Over the years, StarWriter (they dropped the hyphen in 1991) added more features and functionality, even providing compatibility with Microsoft Word files. In 1996, they released StarOffice 3.1, which was the first version to support Linux.<\/p>\n\n\n\n<p>I bought StarOffice in 1997, and it was great! It allowed me to do work on my Linux machine, and remain compatible with the Microsoft Office files at the office.<\/p>\n\n\n\n<p>In 1999, Sun Microsystems purchased StarDivision and released it for free. Later, they released it as open source software, to become OpenOffice.org. That\u2019s where the Open Document Format (ODF) came from, in 2005. However, after Oracle acquired Sun Microsystems in 2009, developers forked OpenOffice.org to become LibreOffice, supported by a foundation called The Document Foundation. LibreOffice has remained under active development since then, and maintains its open roots &#8211; including the ODF open file format.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"whats-in-an-odt-file\">What\u2019s in an ODT file?<\/h2>\n\n\n\n<p>ODF comes in several \u201cflavors\u201d, the most common of which are: <strong>ODT<\/strong> for word processor files (Open Document: Text), <strong>ODS<\/strong> for spreadsheet files (Open Document: Spreadsheet), and <strong>ODP<\/strong> for presentation files (Open Document: Presentation). These are all just zip file containers with XML data and metadata. And that means we can explore them using the <code>unzip<\/code> command line tool; let\u2019s experiment with a sample ODT file.<\/p>\n\n\n\n<p>I saved this one-line document in LibreOffice Writer, called <code>sample.odt<\/code>:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"950\" height=\"704\" src=\"https:\/\/www.both.org\/wp-content\/uploads\/2024\/05\/libreoffice-file.png\" alt=\"screenshot of a 1-line test file in LibreOffice Writer\" class=\"wp-image-5341\"\/><\/figure>\n\n\n\n<p>The <code>zipinfo<\/code> tool shows the internal structure of this file:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ zipinfo sample.odt \nArchive:  sample.odt\nZip file size: 9479 bytes, number of entries: 17\n-rw----     2.0 fat       39 b- stor 24-May-17 14:38 mimetype\n-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2\/accelerator\/\n-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2\/images\/Bitmaps\/\n-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2\/toolpanel\/\n-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2\/floater\/\n-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2\/statusbar\/\n-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2\/toolbar\/\n-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2\/progressbar\/\n-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2\/popupmenu\/\n-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2\/menubar\/\n-rw----     2.0 fat    12782 bl defN 24-May-17 14:38 styles.xml\n-rw----     2.0 fat      899 bl defN 24-May-17 14:38 manifest.rdf\n-rw----     2.0 fat     3878 bl defN 24-May-17 14:38 content.xml\n-rw----     2.0 fat      975 bl defN 24-May-17 14:38 meta.xml\n-rw----     2.0 fat    13831 bl defN 24-May-17 14:38 settings.xml\n-rw----     2.0 fat     1220 b- stor 24-May-17 14:38 Thumbnails\/thumbnail.png\n-rw----     2.0 fat     1061 bl defN 24-May-17 14:38 META-INF\/manifest.xml\n17 files, 34685 bytes uncompressed, 7383 bytes compressed:  78.7%<\/code><\/pre>\n\n\n\n<p>Notice that the <em>first<\/em> file in the archive is called <code>mimetype<\/code> and is saved uncompressed (<code>stor<\/code> indicates it is \u201cstored,\u201d which is not compressed). According to the standard, <code>mimetype<\/code> must be the first file in the archive, and must be uncompressed. This allows any tool to verify that this is a ODT file by reading the zip archive:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The first two bytes of a zip file will be <code>PK<\/code> (because the zip file format was defined by Phil Katz at PKWare in the 1980s)<\/li>\n\n\n\n<li>Skip ahead 28 more bytes (zip file overhead)<\/li>\n\n\n\n<li>Find the string \u201cmimetypeapplication\/vnd.oasis.opendocument.text\u201d which is the one-line uncompressed contents of the <code>mimetype<\/code> file<\/li>\n<\/ol>\n\n\n\n<p>If you are interested in programming, you can write your own program that uses this method to examine files to determine if they are valid ODT files. One such implementation might look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n#include &lt;string.h&gt;\n\nchar buf&#91;47]; \/* global *\/\n\nint magic(FILE *in)\n{\n  \/* read magic number *\/\n\n  fread(buf, 1, 2, in);\n\n  if (strncmp(buf, \"PK\", 2) == 0) {\n    puts(\"PK: this is a zip file\");\n    return 1; \/* yes *\/\n  }\n\n  puts(\"not a zip file\");\n  return 0; \/* no *\/\n}\n\nint skip28(FILE *in)\n{\n  \/* skip 28 more bytes *\/\n\n  fread(buf, 1, 28, in);\n  return 1; \/* success *\/\n}\n\nint mimetype(FILE *in)\n{\n  \/* read \"mimetype\" *\/\n\n  fread(buf, 1, 47, in);\n\n  if (strncmp(buf, \"mimetypeapplication\/vnd.oasis.opendocument.text\", 47) == 0) {\n    puts(\"ODT: mimetype found\");\n    return 1; \/* yes *\/\n  }\n\n  puts(\"didn't find mimetype\");\n  return 0; \/* no *\/\n}\n\nvoid test_odt(FILE *in)\n{\n  if (!magic(in)) {\n    return;\n  }\n\n  if (feof(in)) {\n    puts(\"unexpected EOF\");\n    return;\n  }\n\n  skip28(in);\n\n  if (feof(in)) {\n    puts(\"unexpected EOF\");\n    return;\n  }\n\n  mimetype(in);\n\n  if (feof(in)) {\n    puts(\"unexpected EOF\");\n  }\n\n  return;\n}\n\nint main(int argc, char **argv)\n{\n  FILE *odt;\n  int i;\n\n  for (i = 1; i &lt; argc; i++) {\n    odt = fopen(argv&#91;i], \"rb\");\n\n    if (odt) {\n      puts(\"-----\");\n      puts(argv&#91;i]);\n      test_odt(odt);\n      fclose(odt);\n    }\n    else {\n      fputs(\"cannot open file: \", stdout);\n      puts(argv&#91;i]);\n    }\n  }\n\n  return 0;\n}<\/code><\/pre>\n\n\n\n<p>If I save this as <code>testodt.c<\/code> and compile it, I can demonstrate that the <code>sample.odt<\/code> file has the structure described above:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ gcc -Wall -o testodt testodt.c\n\n$ .\/testodt sample.odt \n-----\nsample.odt\nPK: this is a zip file\nODT: mimetype found<\/code><\/pre>\n\n\n\n<p>The horizontal line makes it easier to see the output if you test several files at once &#8211; although I\u2019ve only tested one file here.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"unzipping-the-odt-file\">Unzipping the ODT file<\/h2>\n\n\n\n<p>We can use the <code>unzip<\/code> command to extract the contents of the sample ODT file to examine it further. I\u2019ll save my copy in a new directory called <code>sample_odt<\/code> so it\u2019s named similarly to the <code>sample.odt<\/code> file I saved from LibreOffice Writer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ unzip sample.odt -d sample_odt\nArchive:  sample.odt\n extracting: sample_odt\/mimetype     \n   creating: sample_odt\/Configurations2\/accelerator\/\n   creating: sample_odt\/Configurations2\/images\/Bitmaps\/\n   creating: sample_odt\/Configurations2\/toolpanel\/\n   creating: sample_odt\/Configurations2\/floater\/\n   creating: sample_odt\/Configurations2\/statusbar\/\n   creating: sample_odt\/Configurations2\/toolbar\/\n   creating: sample_odt\/Configurations2\/progressbar\/\n   creating: sample_odt\/Configurations2\/popupmenu\/\n   creating: sample_odt\/Configurations2\/menubar\/\n  inflating: sample_odt\/styles.xml   \n  inflating: sample_odt\/manifest.rdf  \n  inflating: sample_odt\/content.xml  \n  inflating: sample_odt\/meta.xml     \n  inflating: sample_odt\/settings.xml  \n extracting: sample_odt\/Thumbnails\/thumbnail.png  \n  inflating: sample_odt\/META-INF\/manifest.xml  <\/code><\/pre>\n\n\n\n<p>To locate the contents of an ODT file, we need to first examine the <code>manifest.xml<\/code> file, located in the <code>META-INF<\/code> directory. This is an XML document, so is saved as plain text, which we can display using the <code>cat<\/code> command:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ cat sample_odt\/META-INF\/manifest.xml \n&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\n&lt;manifest:manifest xmlns:manifest=\"urn:oasis:names:tc:opendocument:xmlns:manifest:1.0\" manifest:version=\"1.3\" xmlns:loext=\"urn:org:documentfoundation:names:experimental:office:xmlns:loext:1.0\"&gt;\n &lt;manifest:file-entry manifest:full-path=\"\/\" manifest:version=\"1.3\" manifest:media-type=\"application\/vnd.oasis.opendocument.text\"\/&gt;\n &lt;manifest:file-entry manifest:full-path=\"Configurations2\/\" manifest:media-type=\"application\/vnd.sun.xml.ui.configuration\"\/&gt;\n &lt;manifest:file-entry manifest:full-path=\"styles.xml\" manifest:media-type=\"text\/xml\"\/&gt;\n &lt;manifest:file-entry manifest:full-path=\"manifest.rdf\" manifest:media-type=\"application\/rdf+xml\"\/&gt;\n &lt;manifest:file-entry manifest:full-path=\"content.xml\" manifest:media-type=\"text\/xml\"\/&gt;\n &lt;manifest:file-entry manifest:full-path=\"meta.xml\" manifest:media-type=\"text\/xml\"\/&gt;\n &lt;manifest:file-entry manifest:full-path=\"settings.xml\" manifest:media-type=\"text\/xml\"\/&gt;\n &lt;manifest:file-entry manifest:full-path=\"Thumbnails\/thumbnail.png\" manifest:media-type=\"image\/png\"\/&gt;\n&lt;\/manifest:manifest&gt;<\/code><\/pre>\n\n\n\n<p>This file contains the \u201cmaster\u201d metadata for the ODT file, and indicates where everything is saved. The line that has a <code>text<\/code> media type tells us where our content is stored. In this case, that\u2019s in the <code>content.xml<\/code> file. Since that file doesn\u2019t contain a path, it\u2019s in the \u201croot\u201d of the ODT file.<\/p>\n\n\n\n<p>Again, the <code>content.xml<\/code> file is a plain text XML file. However, it\u2019s quite long; this sample file has 1 line, but over 3,800 characters. So I don\u2019t want to display it with <code>cat<\/code> or I\u2019ll fill up my screen with an XML file that\u2019s hard for humans to read. Instead, let\u2019s break up the XML tags using <code>xmllint<\/code> to add some extra spaces with the <code>--format<\/code> option:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ xmllint --format sample_odt\/content.xml \n&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\n&lt;office:document-content xmlns:css3t=\"http:\/\/www.w3.org\/TR\/css3-text\/\" xmlns:grddl=\"http:\/\/www.w3.org\/2003\/g\/data-view#\" xmlns:xhtml=\"http:\/\/www.w3.org\/1999\/xhtml\" xmlns:xsi=\"http:\/\/www.w3.org\/2001\/XMLSchema-instance\" xmlns:xsd=\"http:\/\/www.w3.org\/2001\/XMLSchema\" xmlns:xforms=\"http:\/\/www.w3.org\/2002\/xforms\" xmlns:dom=\"http:\/\/www.w3.org\/2001\/xml-events\" xmlns:script=\"urn:oasis:names:tc:opendocument:xmlns:script:1.0\" xmlns:form=\"urn:oasis:names:tc:opendocument:xmlns:form:1.0\" xmlns:math=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" xmlns:office=\"urn:oasis:names:tc:opendocument:xmlns:office:1.0\" xmlns:ooo=\"http:\/\/openoffice.org\/2004\/office\" xmlns:fo=\"urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0\" xmlns:ooow=\"http:\/\/openoffice.org\/2004\/writer\" xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xmlns:drawooo=\"http:\/\/openoffice.org\/2010\/draw\" xmlns:oooc=\"http:\/\/openoffice.org\/2004\/calc\" xmlns:dc=\"http:\/\/purl.org\/dc\/elements\/1.1\/\" xmlns:calcext=\"urn:org:documentfoundation:names:experimental:calc:xmlns:calcext:1.0\" xmlns:style=\"urn:oasis:names:tc:opendocument:xmlns:style:1.0\" xmlns:text=\"urn:oasis:names:tc:opendocument:xmlns:text:1.0\" xmlns:of=\"urn:oasis:names:tc:opendocument:xmlns:of:1.2\" xmlns:tableooo=\"http:\/\/openoffice.org\/2009\/table\" xmlns:draw=\"urn:oasis:names:tc:opendocument:xmlns:drawing:1.0\" xmlns:dr3d=\"urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0\" xmlns:rpt=\"http:\/\/openoffice.org\/2005\/report\" xmlns:formx=\"urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0\" xmlns:svg=\"urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0\" xmlns:chart=\"urn:oasis:names:tc:opendocument:xmlns:chart:1.0\" xmlns:officeooo=\"http:\/\/openoffice.org\/2009\/office\" xmlns:table=\"urn:oasis:names:tc:opendocument:xmlns:table:1.0\" xmlns:meta=\"urn:oasis:names:tc:opendocument:xmlns:meta:1.0\" xmlns:loext=\"urn:org:documentfoundation:names:experimental:office:xmlns:loext:1.0\" xmlns:number=\"urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0\" xmlns:field=\"urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0\" office:version=\"1.3\"&gt;\n  &lt;office:scripts\/&gt;\n  &lt;office:font-face-decls&gt;\n    &lt;style:font-face style:name=\"Liberation Sans\" svg:font-family=\"'Liberation Sans'\" style:font-family-generic=\"swiss\" style:font-pitch=\"variable\"\/&gt;\n    &lt;style:font-face style:name=\"Liberation Serif\" svg:font-family=\"'Liberation Serif'\" style:font-family-generic=\"roman\" style:font-pitch=\"variable\"\/&gt;\n    &lt;style:font-face style:name=\"Noto Sans CJK SC\" svg:font-family=\"'Noto Sans CJK SC'\" style:font-family-generic=\"system\" style:font-pitch=\"variable\"\/&gt;\n    &lt;style:font-face style:name=\"Noto Sans Devanagari\" svg:font-family=\"'Noto Sans Devanagari'\" style:font-family-generic=\"swiss\"\/&gt;\n    &lt;style:font-face style:name=\"Noto Sans Devanagari1\" svg:font-family=\"'Noto Sans Devanagari'\" style:font-family-generic=\"system\" style:font-pitch=\"variable\"\/&gt;\n    &lt;style:font-face style:name=\"Noto Serif CJK SC\" svg:font-family=\"'Noto Serif CJK SC'\" style:font-family-generic=\"system\" style:font-pitch=\"variable\"\/&gt;\n  &lt;\/office:font-face-decls&gt;\n  &lt;office:automatic-styles&gt;\n    &lt;style:style style:name=\"P1\" style:family=\"paragraph\" style:parent-style-name=\"Standard\"&gt;\n      &lt;style:text-properties officeooo:rsid=\"00157cf3\" officeooo:paragraph-rsid=\"00157cf3\"\/&gt;\n    &lt;\/style:style&gt;\n  &lt;\/office:automatic-styles&gt;\n  &lt;office:body&gt;\n    &lt;office:text&gt;\n      &lt;text:sequence-decls&gt;\n        &lt;text:sequence-decl text:display-outline-level=\"0\" text:name=\"Illustration\"\/&gt;\n        &lt;text:sequence-decl text:display-outline-level=\"0\" text:name=\"Table\"\/&gt;\n        &lt;text:sequence-decl text:display-outline-level=\"0\" text:name=\"Text\"\/&gt;\n        &lt;text:sequence-decl text:display-outline-level=\"0\" text:name=\"Drawing\"\/&gt;\n        &lt;text:sequence-decl text:display-outline-level=\"0\" text:name=\"Figure\"\/&gt;\n      &lt;\/text:sequence-decls&gt;\n      &lt;text:p text:style-name=\"P1\"&gt;This is a LibreOffice file.&lt;\/text:p&gt;\n    &lt;\/office:text&gt;\n  &lt;\/office:body&gt;\n&lt;\/office:document-content&gt;<\/code><\/pre>\n\n\n\n<p>There\u2019s a lot of overhead in the XML structure, including some style definitions. But the content is easy enough to find: my document\u2019s one-line contents is in an XML tag called <code>text:p<\/code> that carries a <code>text:style-name<\/code> attribute with the value <code>P1<\/code> (which is the name of a style defined a few lines earlier in the file).<\/p>\n\n\n\n<p>In fact, we can extract just the file\u2019s paragraph contents by filtering the output with <code>grep<\/code> to find just the <code>text:p<\/code> tags:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ xmllint --format sample_odt\/content.xml | grep 'text:p'\n      &lt;text:p text:style-name=\"P1\"&gt;This is a LibreOffice file.&lt;\/text:p&gt;<\/code><\/pre>\n\n\n\n<p>You can do the same to find other content stored in any ODT file you have, such as headings which are saved as <code>text:h<\/code>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"odt-files-are-open-data\">ODT files are open data<\/h2>\n\n\n\n<p>Not every file format is like this; for example, some other word processors (especially earlier systems before \u201copen source\u201d became the norm) essentially saved a file by dumping the contents of memory into a file. This provided a fast way to save and load data, but meant the file format remained closed and made it more difficult to import into other programs that didn\u2019t have the same internal memory structures.<\/p>\n\n\n\n<p>ODT and all other files in the Open Document Format (ODF) is an open file format that can be read by anything. This avoids \u201cvendor lock-in\u201d because the open nature of ODT means you can always convert your ODT files to another format if you wish, even without using LibreOffice.<\/p>\n\n\n\n<p><em>This article is adapted from <a href=\"https:\/\/technicallywewrite.com\/2024\/05\/20\/libreofficeodt\">What\u2019s inside a LibreOffice ODT file<\/a> by Jim Hall, and is republished with the author&#8217;s permission.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>All LibreOffice files are zip file containers with XML data and metadata.<\/p>\n","protected":false},"author":33,"featured_media":2813,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[29,237,150],"tags":[133,152],"class_list":["post-5340","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-information","category-libreoffice","category-programming","tag-libreoffice","tag-programming"],"modified_by":"Jim Hall","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/5340","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5340"}],"version-history":[{"count":1,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/5340\/revisions"}],"predecessor-version":[{"id":5342,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/5340\/revisions\/5342"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/2813"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5340"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5340"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5340"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}