{"id":12023,"date":"2025-10-17T02:00:00","date_gmt":"2025-10-17T06:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=12023"},"modified":"2025-09-26T13:20:48","modified_gmt":"2025-09-26T17:20:48","slug":"how-to-parse-text-strings-in-c","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=12023","title":{"rendered":"How to parse text strings in C"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"12023\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p>Some programs can just process an entire file at once, and other programs need to examine the file line-by-line. In the latter case, you likely need to parse data in each line. Fortunately, the C programming language has a standard C library function to do just that.<\/p>\n\n\n\n<p>The <code>strtok<\/code> function breaks up a line of data according to &#8220;delimiters&#8221; that divide each field. It provides a streamlined way to parse data from an input string. Let&#8217;s dig further into how to use <code>strtok<\/code> to parse a string in C:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"read-the-first-token\">Read the first token<\/h2>\n\n\n\n<p>Suppose your program needs to read a data file, where each line is separated into different fields with a semicolon. For example, one line from the data file might look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>102*103;K1.2;K0.5<\/code><\/pre>\n\n\n\n<p>In this example, store that in a string variable. You might have read this string into memory using any number of methods. For example, you might have a program that read input from a file using <code>fgets<\/code> like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>fgets(str, MAX_STR, *file);<\/code><\/pre>\n\n\n\n<p>Once you have the line in a string, you can use <code>strtok<\/code> to pull out &#8220;tokens.&#8221; Each token is part of the string, up to the next delimiter. The basic call to <code>strtok<\/code> looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>char *strtok(char *string, const char *delim);<\/code><\/pre>\n\n\n\n<p>The first call to <code>strtok<\/code> reads the string, adds a zero (also called a &#8220;null&#8221; or literal <code>\\0<\/code> value) character at the first delimiter, then returns a pointer to the first token. If the string is already empty, <code>strtok<\/code> returns <code>NULL<\/code>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n#include &lt;string.h&gt; \/* strtok *\/\n\nint main()\n{\n  char string&#91;] = \"102*103;K1.2;K0.5\";\n  char *token;\n\n  token = strtok(string, \";\");\n\n  if (token == NULL) {\n    puts(\"empty string!\");\n    return 1;\n  }\n\n  puts(token);\n\n  return 0;\n}<\/code><\/pre>\n\n\n\n<p>This sample program pulls off the first token in the string, prints it, and exits. If you compile this program and run it, you should see this output:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>102*103<\/code><\/pre>\n\n\n\n<p>102*103 is the first part of the input string, up to the first semicolon. That&#8217;s the first token in the string.<\/p>\n\n\n\n<p>Note that calling <code>strtok<\/code> modifies the string you are examining. If you want the original string preserved, make a copy before using <code>strtok<\/code>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-next-tokens\">The next tokens<\/h2>\n\n\n\n<p>Separating the rest of the string into tokens requires calling <code>strtok<\/code> multiple times until all tokens are read. After parsing the first token with <code>strtok<\/code>, any further calls to <code>strtok<\/code> must use <code>NULL<\/code> in place of the string variable. The <code>NULL<\/code> allows <code>strtok<\/code> to use an internal pointer to the next position in the string.<\/p>\n\n\n\n<p>Modify the sample program to read the rest of the string as tokens. Use a <code>while<\/code> loop to call <code>strtok<\/code> multiple times until you get <code>NULL<\/code>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n#include &lt;string.h&gt;\n\nint main()\n{\n  char string&#91;] = \"102*103;K1.2;K0.5\";\n  char *token;\n\n  token = strtok(string, \";\");\n\n  if (token == NULL) {\n    puts(\"empty string!\");\n    return 1;\n  }\n\n  while (token) {\n    \/* print the token *\/\n    puts(token);\n\n    \/* parse the same string again *\/\n    token = strtok(NULL, \";\");\n  }\n\n  return 0;\n}<\/code><\/pre>\n\n\n\n<p>By adding the while loop, you can parse the rest of the string, one token at a time. If you compile and run this sample program, you should see each token printed on a separate line, like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>102*103\nK1.2\nK0.5<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"be-careful-about-delimiters\">Be careful about delimiters<\/h2>\n\n\n\n<p>Using <code>strtok<\/code> provides a quick and easy way to break up a string into just the parts you&#8217;re looking for. You can use <code>strtok<\/code> to parse all kinds of data, from plain text files to complex data. However, be careful that multiple delimiters next to each other are the same as one delimiter.<\/p>\n\n\n\n<p>For example, if you were reading CSV data (comma-separated values, such as data from a spreadsheet), you might expect a list of four numbers to look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1,2,3,4<\/code><\/pre>\n\n\n\n<p>But if the third &#8220;column&#8221; in the data was empty, the CSV might instead look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1,2,,4<\/code><\/pre>\n\n\n\n<p>This is where you need to be careful with <code>strtok<\/code>. With <code>strtok<\/code>, multiple delimiters next to each other are the same as a single delimiter. You can see this by modifying the sample program to call <code>strtok<\/code> with a comma delimiter:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n#include &lt;string.h&gt;\n\nint main()\n{\n  char string&#91;] = \"1,2,,4\";\n  char *token;\n\n  token = strtok(string, \",\");\n\n  if (token == NULL) {\n    puts(\"empty string!\");\n    return 1;\n  }\n\n  while (token) {\n    puts(token);\n    token = strtok(NULL, \",\");\n  }\n\n  return 0;\n}<\/code><\/pre>\n\n\n\n<p>If you compile and run this new program, you&#8217;ll see <code>strtok<\/code> interprets the <code>,,<\/code> <em>the same as<\/em> a single comma and parses the data as three numbers:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1\n2\n4<\/code><\/pre>\n\n\n\n<p>Knowing this limitation in <code>strtok<\/code> can save you hours of debugging. If you need to parse data that might have multiple delimiters in the string, you might consider writing your own parser to break it up in some other way. It&#8217;s up to you. More complex programs might need this extra feature, but many do not.<\/p>\n\n\n\n<p>For example, programs that only need to parse &#8220;words&#8221; from a string, where each word is separated by one or more spaces, can use this feature in <code>strtok<\/code> to parse words one at a time. But programs that need to be careful about the delimiters, such as reading a CSV file where commas are significant, might need to do it another way.<\/p>\n\n\n\n<p>Let&#8217;s keep things simple and stay focused on using <code>strtok<\/code> in the default way.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"using-multiple-delimiters\">Using multiple delimiters<\/h2>\n\n\n\n<p>You might wonder why the <code>strtok<\/code> function uses a string for the delimiter instead of a single character. That&#8217;s because <code>strtok<\/code> can look for different delimiters in the string. For example, a string of text might have spaces and tabs between each word. In this case, you would use each of those &#8220;whitespace&#8221; characters as delimiters:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n#include &lt;string.h&gt;\n\nint main()\n{\n  char string&#91;] = \"  hello \\t world\";\n  char *token;\n\n  token = strtok(string, \" \\t\");\n\n  if (token == NULL) {\n    puts(\"empty string\");\n    return 1;\n  }\n\n  while (token) {\n    puts(token);\n    token = strtok(NULL, \" \\t\");\n  }\n\n  return 0;\n}<\/code><\/pre>\n\n\n\n<p>Each call to <code>strtok<\/code> uses both a space and tab character as the delimiter string, allowing <code>strtok<\/code> to parse the line correctly into two tokens:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>hello\nworld<\/code><\/pre>\n\n\n\n<p>This feature is very helpful when parsing &#8220;words&#8221; from a string. Instead of just using spaces as the delimiter between words, a program can use any white space, including spaces, tabs, vertical tabs, and new lines.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"using-strtok\">Using strtok<\/h2>\n\n\n\n<p>The <code>strtok<\/code> function is a handy way to read and interpret data from strings. Use it in your next project to simplify how you read data into your program.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><em>This article is adapted from <a href=\"https:\/\/opensource.com\/article\/22\/4\/parsing-data-strtok-c\">Parsing data with strtok in C<\/a> by Jim Hall, and is republished with the author&#8217;s permission.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Use strtok to parse a string using &#8216;tokens&#8217; in your next C program.<\/p>\n","protected":false},"author":33,"featured_media":2949,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[5,150],"tags":[91,152],"class_list":["post-12023","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-linux","category-programming","tag-linux","tag-programming"],"modified_by":"Jim Hall","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/12023","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=12023"}],"version-history":[{"count":1,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/12023\/revisions"}],"predecessor-version":[{"id":12024,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/12023\/revisions\/12024"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/2949"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=12023"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=12023"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=12023"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}