{"id":9990,"date":"2025-03-20T03:00:00","date_gmt":"2025-03-20T07:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=9990"},"modified":"2025-03-20T08:44:09","modified_gmt":"2025-03-20T12:44:09","slug":"reading-a-whole-file-at-once","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=9990","title":{"rendered":"Reading a whole file at once"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"9990\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p>Most of the programs that I write are filter-like utilities: the program starts up, processes data as it goes, then ends. Usually, these programs don\u2019t have to store a lot of data in memory. If I can find a way to only read data a character or a block at a time, I\u2019ll do it that way. This is probably because I learned C programming on DOS, and DOS systems usually have very limited memory. From the start, my programming practice has been <em>load only what you need into memory.<\/em><\/p>\n\n\n\n<p>Recently, I\u2019ve started working on a project that requires loading the contents of a data file into memory, and working with the copy in memory. The data file is a text file created by the user, so it could be a few lines long, or a hundred lines. You can approach this using two methods, depending on the system. Because this kind of programming problem comes up all the time in larger projects, I wanted to share an example of how to load a file into memory all at once.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-traditional-method\">The traditional method<\/h2>\n\n\n\n<p>One classic way to load a complete file is to get the size of the file, allocate enough memory to store it, then read the file into memory. At a high level, this requires four steps:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><code>open()<\/code> to open the file<\/li>\n\n\n\n<li><code>filelength()<\/code> or <code>fstat()<\/code> to get the size of the file in bytes<\/li>\n\n\n\n<li><code>calloc()<\/code> to allocate the memory<\/li>\n\n\n\n<li><code>read()<\/code> to read the file into memory<\/li>\n<\/ol>\n\n\n\n<p>Let\u2019s say I wanted to load the contents of a file into an array so I could work on it. I might create a function called <code>open_file()<\/code> that reads a file into memory; for the purposes of keeping my demonstration a simple one, let\u2019s only deal with one file at a time, and load the contents into a global string array called <code>Fdata<\/code>, of size <code>Fsize<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>char *Fdata;\nsize_t Fsize;\n\nsize_t open_file(const char *filename)\n{\n    int fd;\n    int nread;\n\n    fd = open(filename, O_RDONLY);\n    if (fd &lt; 0) {\n        puts(\"cannot open file\");\n        close(fd);\n        return 0;\n    }\n\n    Fsize = filelength(fd);\n\n    Fdata = calloc(sizeof(char), Fsize);\n    if (Fdata == NULL) {\n        puts(\"out of memory\");\n        close(fd);\n        return 0;\n    }\n\n    nread = read(fd, Fdata, Fsize);\n    Fsize = nread;\n\n    close(fd);\n\n    return Fsize;\n}<\/code><\/pre>\n\n\n\n<p>This function takes a single argument: the name of a file to load into memory. It uses <code>fd = open(filename, O_RDONLY)<\/code> to open the file in read-only mode, and stores the file descriptor as <code>fd<\/code>. I wrote this sample program on DOS, so I used <code>Fsize = filelength(fd)<\/code> to get the size of the file in bytes, and store it in <code>Fsize<\/code>. Note that <code>filelength()<\/code> is only available on DOS; on Linux, you might use the <code>fstat<\/code> function to get the file statistics, including file size.<\/p>\n\n\n\n<p>The function uses <code>Fdata = calloc(sizeof(char), Fsize)<\/code> to allocate enough memory in <code>Fdata<\/code> to store the full contents of the file, then calls <code>nread = read(fd, Fdata, Fsize)<\/code> to read the first <code>Fsize<\/code> bytes (the full contents of the file) into the <code>Fdata<\/code> array; this also saves the number of bytes read from the file in <code>nread<\/code>.<\/p>\n\n\n\n<p>When working with text files on a DOS system, <code>nread<\/code> will always be less than the size of <code>Fsize<\/code> because the Carriage Return + New Line pairs that DOS uses for line-endings will get converted from <code>\\r\\n<\/code> to just <code>\\n<\/code>. So this method actually allocates a little more memory than needed, but that\u2019s often an acceptable tradeoff.<\/p>\n\n\n\n<p>What\u2019s great about this method is that it stores the full contents of a file into a <code>char<\/code> array, which is just a giant string that\u2019s big enough to keep the entire file in memory. After loading the file, you can access the <code>Fdata<\/code> string as you would any array. Just don\u2019t forget to release the memory when the program is done with it.<\/p>\n\n\n\n<p>Let\u2019s demonstrate this method by writing a full program that uses <code>open_file()<\/code> to load a data file into memory, then print the contents of the array:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n#include &lt;stdlib.h&gt;                    \/* calloc, free *\/\n\n#include &lt;io.h&gt;                        \/* open, close, filelength *\/\n#include &lt;fcntl.h&gt;                     \/* O_RDNLY *\/\n\nchar *Fdata;\nsize_t Fsize;\n\nsize_t open_file(const char *filename)\n{\n  ...\n}\n\nint main()\n{\n    size_t i;\n\n    if (open_file(\"data.dat\") == 0) {\n        puts(\"failed\");\n        return 1;\n    }\n\n    puts(\"file data:\");\n    for (i = 0; i &lt; Fsize; i++) {\n        printf(\"%c&lt;%d&gt;\", Fdata&#91;i], Fdata&#91;i]);\n    }\n    puts(\"EOF\");\n\n    free(Fdata);\n\n    return 0;\n}<\/code><\/pre>\n\n\n\n<p>The program uses a loop to iterate through the data, and print the contents as both a regular character and its ASCII value. I did this to demonstrate that the Carriage Return (ASCII 13) + New Line (ASCII 10) pair get translated to just a single New Line.<\/p>\n\n\n\n<p>In one example, the program might load a very short file like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>K 4\nK 1 ; K 2 ; K 3 ; K 4\n201 + 202 + 203 + 204 \/ 101<\/code><\/pre>\n\n\n\n<p>For my sample data file, the program prints this output:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>file data:\nK&lt;75&gt; &lt;32&gt;4&lt;52&gt;\n&lt;10&gt;K&lt;75&gt; &lt;32&gt;1&lt;49&gt; &lt;32&gt;;&lt;59&gt; &lt;32&gt;K&lt;75&gt; &lt;32&gt;2&lt;50&gt; &lt;32&gt;;&lt;59&gt; &lt;32&gt;K&lt;75&gt; &lt;32&gt;3&lt;51&gt; &lt;32&gt;;&lt;59&gt; &lt;32&gt;K&lt;75&gt; &lt;32&gt;4&lt;52&gt;\n&lt;10&gt;2&lt;50&gt;0&lt;48&gt;1&lt;49&gt; &lt;32&gt;+&lt;43&gt; &lt;32&gt;2&lt;50&gt;0&lt;48&gt;2&lt;50&gt; &lt;32&gt;+&lt;43&gt; &lt;32&gt;2&lt;50&gt;0&lt;48&gt;3&lt;51&gt; &lt;32&gt;+&lt;43&gt; &lt;32&gt;2&lt;50&gt;0&lt;48&gt;4&lt;52&gt; &lt;32&gt;\/&lt;47&gt; &lt;32&gt;1&lt;49&gt;0&lt;48&gt;1&lt;49&gt;\n&lt;10&gt;EOF<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-modern-method\">The modern method<\/h2>\n\n\n\n<p>If you\u2019re working on Linux or another modern Unix-like system, there\u2019s another, more efficient method to load an entire data file into memory. The <code>mmap()<\/code> system call \u201cmaps\u201d a file into memory, while providing memory protection and isolating any changes to the copy in memory so they don\u2019t automatically get saved back to the file. In my case, my program only needs to read the file, and possibly modify the copy that\u2019s stored in memory. I don\u2019t want to alter the file that\u2019s saved on disk.<\/p>\n\n\n\n<p>And you can use <code>mmap()<\/code> in exactly this way. At a high level, this requires basically the same steps as before, but without allocating memory and replacing the <code>read()<\/code> system call with <code>mmap()<\/code>:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><code>open()<\/code> to open the file<\/li>\n\n\n\n<li><code>fstat()<\/code> to get the size of the file in bytes<\/li>\n\n\n\n<li><code>map()<\/code> to map the file into memory<\/li>\n<\/ol>\n\n\n\n<p>Let\u2019s keep the sample program more or less the same, so it\u2019s easy to compare the two methods. To load the contents of a file into an array, I might create a function called <code>open_file()<\/code> that reads a file into a global string array called <code>Fdata<\/code> of size <code>Fsize<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>char *Fdata;\nsize_t Fsize;\n\nsize_t open_file(const char *filename)\n{\n    int fd;\n    struct stat inf;\n\n    fd = open(filename, O_RDONLY);\n    if (fd &lt; 0) {\n        puts(\"cannot read file\");\n        return 0;\n    }\n\n    if (fstat(fd, &amp;inf) != 0) {\n        puts(\"cannot stat file\");\n        close(fd);\n        return 0;\n    }\n\n    Fsize = (size_t) inf.st_size;\n\n    Fdata = mmap(NULL, Fsize, PROT_READ, MAP_PRIVATE, fd, 0);\n\n    if (Fdata == MAP_FAILED) {\n        puts(\"cannot mmap file\");\n        close(fd);\n        return 0;\n    }\n\n    close(fd);\n    return Fsize;\n}<\/code><\/pre>\n\n\n\n<p>The <code>mmap()<\/code> system call is a bit tricky, so let\u2019s look at the options. The general usage of <code>mmap()<\/code> looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>void* mmap(void addr, size_t len, int prot, int flags, int fd, off_t offset)<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The address <code>addr<\/code> to use for the mapping. Set this to <code>NULL<\/code> to let the kernel choose an address for the mapping (this is recommended).<\/li>\n\n\n\n<li><code>len<\/code> is the size of the region that should be mapped. I\u2019ve used the size of the file, so it maps the full file.<\/li>\n\n\n\n<li><code>prot<\/code> provides the memory protections to use, like <code>PROT_READ<\/code> for read-only. See the <code>mmap<\/code>(2) manual page for other protections.<\/li>\n\n\n\n<li><code>flags<\/code> indicates whether updates to the map should be visible to other processes, or if updates to the map should be saved back to the file. Using <code>MAP_PRIVATE<\/code> makes this a private copy-on-write mapping.<\/li>\n\n\n\n<li><code>fd<\/code> is the file descriptor to read from.<\/li>\n\n\n\n<li><code>offset<\/code> is the starting point. Use 0 for the start of the file.<\/li>\n<\/ol>\n\n\n\n<p>You may have noticed the function closes the file after mapping it into memory. The <code>mmap<\/code>(2) manual page says that after the <code>mmap()<\/code> system call has returned, the file descriptor can be closed immediately without invalidating the mapping.<\/p>\n\n\n\n<p>Mapping a file into memory is more efficient, and often faster, but still makes the full contents of a file available as a <code>char<\/code> array. After mapping the file, access the <code>Fdata<\/code> string as you would any array. Just don\u2019t forget to end the mapping when the program is done working on the file.<\/p>\n\n\n\n<p>Let\u2019s demonstrate this method by writing a full program that uses <code>open_file()<\/code> to load a data file into memory, then print the contents of the array:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n\n#include &lt;fcntl.h&gt;                     \/* open *\/\n#include &lt;unistd.h&gt;                    \/* close *\/\n\n#include &lt;sys\/stat.h&gt;                  \/* stat *\/\n#include &lt;sys\/mman.h&gt;                  \/* mmap *\/\n\nchar *Fdata;\nsize_t Fsize;\n\nsize_t open_file(const char *filename)\n{\n  ...\n}\n\nint main()\n{\n    if (open_file(\"data.dat\") == 0) {\n        puts(\"failed\");\n        return 1;\n    }\n\n    puts(\"File data:\");\n    for (size_t i = 0; i &lt; Fsize; i++) {\n        printf(\"%c&lt;%d&gt;\", Fdata&#91;i], Fdata&#91;i]);\n    }\n    puts(\"EOF\");\n\n    munmap(Fdata, Fsize);\n    return 0;\n}<\/code><\/pre>\n\n\n\n<p>Processing the same data file on my Linux system, but using Unix line endings, generates this output:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>File data:\nK&lt;75&gt; &lt;32&gt;4&lt;52&gt;\n&lt;10&gt;K&lt;75&gt; &lt;32&gt;1&lt;49&gt; &lt;32&gt;;&lt;59&gt; &lt;32&gt;K&lt;75&gt; &lt;32&gt;2&lt;50&gt; &lt;32&gt;;&lt;59&gt; &lt;32&gt;K&lt;75&gt; &lt;32&gt;3&lt;51&gt; &lt;32&gt;;&lt;59&gt; &lt;32&gt;K&lt;75&gt; &lt;32&gt;4&lt;52&gt;\n&lt;10&gt;2&lt;50&gt;0&lt;48&gt;1&lt;49&gt; &lt;32&gt;+&lt;43&gt; &lt;32&gt;2&lt;50&gt;0&lt;48&gt;2&lt;50&gt; &lt;32&gt;+&lt;43&gt; &lt;32&gt;2&lt;50&gt;0&lt;48&gt;3&lt;51&gt; &lt;32&gt;+&lt;43&gt; &lt;32&gt;2&lt;50&gt;0&lt;48&gt;4&lt;52&gt; &lt;32&gt;\/&lt;47&gt; &lt;32&gt;1&lt;49&gt;0&lt;48&gt;1&lt;49&gt;\n&lt;10&gt;EOF<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Things to know<\/h2>\n\n\n\n<p>There are some limitations on both of these implementations. For example, mmap is only available on Linux and other Unix-like systems; you cannot use mmap on DOS. Instead, DOS programs can only load files using the first method, by storing the file in memory. But DOS has limited memory, so many DOS programmers are either careful about how much data they need to store, or they load only the parts they need from the file.<\/p>\n\n\n\n<p>The first method of reading a file into memory is available on all platforms, but is less efficient on other systems like Linux. With mmap, you&#8217;re actually mapping your file access to memory, so the operating system only loads what it needs to. If your program needs to read a file and store its contents in memory like an array, using mmap is probably the better option on Linux.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Two methods to load a data file into memory. Use mmap on Linux sysetms.<\/p>\n","protected":false},"author":33,"featured_media":2818,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[340,5,150],"tags":[267,91,152],"class_list":["post-9990","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-freedos","category-linux","category-programming","tag-freedos","tag-linux","tag-programming"],"modified_by":"Jim Hall","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/9990","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=9990"}],"version-history":[{"count":4,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/9990\/revisions"}],"predecessor-version":[{"id":10008,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/9990\/revisions\/10008"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/2818"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=9990"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=9990"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=9990"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}