Tips for well-performing SQL code
When designing a SQL statement, there are some things that need to be considered to ensure that the SQL statement is running at optimum performance. Topics that might affect SQL query performance are:
- Connecting data tables
- Filtering and converting data
- Ordering data
- Use of loops
Connecting data tables
When connecting two or more data tables, it is recommended to use the SQL JOIN statement. A FROM-WHERE-construction might have the same performance, but when one table contains optional information, the OUTER JOIN SQL statements generate far less complex SQL. In the case of optional connections, a FROM-WHERE-construction requires nested select statements.
Filtering and converting data
Filtering data should always be the first step when gathering data in an SQL statement, as it keeps the amount of data that has to be processed as low as possible. When the data would need to be converted in order to be comparable to the filter criteria, the performance can be improved significantly by instead converting the filter criteria to be comparable to the data. This approach allows the query optimizer to potentially use an index on the column to be filtered. Also reducing the number of calls to conversion functions increases the query performance.
This example shows how to apply the filtering and conversion best practices: a data table contains columns for variable name, timestamp as unixtime and value. The SQL statement receives time filter parameters as UTC timestamps and shall return columns for variable name, timestamp as UTC timestamp and value. Applying the best practices, these steps are necessary:
- Convert the filter criteria from UTC timestamp to unixtime before reading the table (2 conversion function calls)
- Filter the timestamp column of the source table based on the converted time filter criteria in the WHERE clause of the statement (no conversion function calls, an index on the timestamp column in the source table can potentially be used)
- Convert the timestamp column from unixtime to UTC timestamp in the SELECT clause of the statement (1 conversion function call for each row that matches the filter criteria)
Inside a SQL statement, the order of the data is not relevant (except when iterating through data in some kinds of loops; see next section). So ordering the data should always be the last part of a SQL statement, as this step can consume high amounts of CPU time and memory when there is much data that has to be ordered.
Use of loops
Generally, the query optimizer cannot reorganize a loop to increase performance. So when possible, loops should always be avoided to gain optimum performance.
Here is an example for that: in an event list (which is a table or table variable), there is an entry when a machine switches in a certain operation mode which contains a start-timestamp but no end-timestamp. For further calculation, an end-timestamp for every entry is required.
The first approach could look like this:
- Create a cursor for iterating through the data.
- Sort the data descending by the start-timestamp in the SELECT statement of the cursor.
- Create a local variable for storing a timestamp and initialize it with the end of the time filter.
- Iterate through the data in the cursor in a loop:
a. Update the current row and set the end-timestamp to the value stored in the local variable.
b. Update the local variable with the start-timestamp of the current row.
This approach generates more complex SQL code, but offers improved performance:
- Create a second table or table variable that contains an additional column for the end-timestamp which is nullable.
- Create a SELECT statement that reads all rows out of the original data table (called “T1”) and makes an OUTER JOIN to the very same table (called “T2”). The JOIN condition would be that the start-timestamp of T2 is higher than the start-timestamp of T1.
- Every entry in T1 is now connected to all entries in T2 that have a higher timestamp (all operation mode changes that have been made after the operation mode change in the current row of T1). The last operation mode change has no such entry, which is why you should use an OUTER JOIN.
- Create a GROUP BY clause that groups on all columns read from T1.
- In the SELECT clause, include all columns of T1 and the minimum start-timestamp from T2.
- Add an INSERT clause to the statement and store the data in the table or table variable created in step 1.
- As the last operation mode change has NULL as end-timestamp, create an UPDATE statement for setting the end-timestamp in the table or table variable created in step 1 to the end of the time filter when the current value of that column is NULL.
So avoiding loops increases performance but decreases SQL code readability. In order to ensure readable SQL code, comments can be used in such regions where the code is very complex for performance reasons.