Are All Delimited-String Parsers Created Equal?

By Jeffry Schwartz | Function

Feb 14

Overview

In short, no.  However, for many years there was no alternative to the T-SQL option, and even when there were other viable options, it was unclear whether it was worth changing existing code to take advantage of the new technologies.  With the addition of the string_split function in SQL Server 2016, it is worth revisiting this topic.  This article will discuss three different methods (T-SQL, C# CLR, and string_split) for breaking delimited strings into result sets and compare their performance across varying length lists.  Three different T-SQL functions will be tested along with the C# CLR and string_split methods.  The examples provided employ numerical lists delimited by commas and return either strings or integers to determine whether there is a significant performance difference when numbers are converted from strings to integers.

Testing Strategy

To ensure as much consistency as possible, a series of random numbers were generated and merged into varying-length lists of 20, 50, 75, 100, 150, 500, 1000, 1500, 2500, and 5000 integers.  Although the list from which the test strings were derived contained random numbers, the same list was used as input for each of the tests, i.e., the first 20 elements of all tests were identical.  Query results were discarded because allowing them to be returned warped the final run times badly, increasing them by at least a factor of FOUR, even when the results were written to a file!  An example of a list of 20 is as follows:  select * from dbo.fcn_SplitCSV (‘7503, 87355, 74205, 3985, 5811, 1286, 94488, 33989, 8642, 17592, 80938, 48701, 84713, 430, 54960, 46492, 9916, 38679, 89117, 5703’).  The full T-SQL invocation code for the 20-element test of fcn_SplitCSV is shown below.  As evidenced by the code, each test was performed 2,500 times to minimize the probability that any individual variations would skew the final results.

In addition to the three major methods cited above, two other variations were tested:  conversion of strings to numbers in the result set and specifying a delimiter using a string variable instead of a string constant.  Each test was performed in its own batch on an otherwise idle system.  Run-time performance was captured using an Extended Events session, which was configured to capture sqlserver.sql_batch_completed events ONLY.  The session definition, start, and stop code are shown below.  This session collects information for completed batches that ran longer than 10 microseconds and were executed from database #7.

The Code

Three different T-SQL functions were used for testing.  Each of the functions used different T-SQL logic to parse the string, and one of them was modified to use a delimiter, specified in a variable as opposed to using a string constant.  The code for the three T-SQL functions:  fn_CSVToTable,  fcn_SplitCSV, and fcn_SplitStringListDelimiter, is shown below:

The C# CLR function, tvf_SplitString_Multi, is shown below followed by its T-SQL definition:

The final piece of the testing puzzle involves the simple invocation of the string_split function that is documented via this link:  https://docs.microsoft.com/en-us/sql/t-sql/functions/string-split-transact-sql?view=sql-server-2016.

Test Results

Although the author believed that the conversion of strings to integers and specifying a delimiter using a variable instead of a string constant might result in a discernible performance difference, these hypotheses were not supported by the test results.  However, significant performance differences were observed between the T-SQL and compiled-code (C# CLR and string_split) solutions.  Interestingly, the solutions performed similarly until the number of elements exceeded 150, at which point they diverged significantly.

The first two execution time graphs below, one with normal scaling and the other with truncated scaling, highlight the divergence of the two solutions.  As these graphs illustrate, the compiled-code solutions scaled extremely well all the way up to 1,500 elements.  They increase slightly after this point, but performance was still quite good, peaking at three milliseconds.  The same cannot be said for the T-SQL solutions, all of which peaked over 70 milliseconds.  Clearly, if these routines are executed a few times or parse very small lists, all of these solutions are acceptable.  However, if the parsing routine is executed thousands of times or the number of elements is long, the choice would be very important.  To emphasize this point, 2,500 executions of fn_CSVToTable took a total of 33 seconds for a list of 1,500 elements, whereas the string_split solution required only 0.45 seconds to complete the same amount of work.  This disparity becomes even greater when the length of the parsed list reaches 5,000 elements.  fn_CSVToTable required 180 seconds to complete 2,500 iterations, whereas string_split required only eight seconds.

Two other metrics were noteworthy:  logical reads and row counts.  As shown in Figure 3, the two compiled-code solutions performed almost no logical reads, whereas the T-SQL ones performed thousands to millions.  Secondly, as shown in Figure 4, the row counts of the T-SQL methods were approximately five times higher than those of the compiled-code. The execution plans for the five routines explain why the metrics were so different.  As shown in Figure 5 through Figure 9, despite the fact that each T-SQL routine is different, all of them generated table scans in addition to the table valued function calls, whereas the compiled-code ONLY generated the table valued function calls.

Figure 1: Average Execution Time – Normal Scaling

Figure 2: Average Execution Time – Truncated Scaling

Figure 3: Logical Reads

Figure 4: Row Counts

Figure 5: fcn_SplitCSV Execution Plan

Figure 6: fcn_SplitStringListDelimiter Execution Plan

Figure 7: fn_CSVToTable Execution Plan

Figure 8: tvf_SplitString_Multi Execution Plan

Figure 9: String_Split Execution Plan

Conclusion

In summary, existing T-SQL code need not be replaced unless it is expected to parse lists of more than 150 elements or be executed thousands of times.  However, if possible, new development that will run on SQL Server 2016 or higher should use the string_split function.

For more information about blog posts, concepts and definitions, further explanations, or questions you may have…please contact us at SQLRx@sqlrx.com.  We will be happy to help!  Leave a comment and feel free to track back to us.  We love to talk tech with anyone in our SQL family!

>