Improve SQL Performance for Date Range Queries

Article

Kyle Baxter · Jul 19, 2016 2m read

#Best Practices #Code Snippet #SQL #InterSystems IRIS

Date range queries going too slow for you? SQL Performance got you down? I have one weird trick that might just help you out! (SQL Developers hate this!)*

If you have a class that records timestamps when the data is added, then that data will be in sequence with your IDKEY values - that is, TimeStamp₁< TimeStamp₂if and only if ID₁ < ID₂for all IDs and TimeStamp values in table - then you can use this knowledge to increase performance for queries against TimeStamp ranges. Consider the following table:

Class User.TSOrder extends %Persistent 
{ 

Property TS as %TimeStamp;

Property Data as %String (MAXLEN=100, MINLEN=200);

Index TSIdx on TS;

Index Extent [type=bitmap, extent];

}

Populating this with 30,000,000 random rows with dates over the last 30 days, will give you 1,000,000 rows per day. Now if we want to query the information for a given day you might write the following

SELECT ID, TS, Data 
FROM TSOrder
WHERE 
     TS >= '2016-07-01 00:00:00.00000' AND 
     TS <= '2016-07-01 23:59:59.999999'

A reasonable query, to be sure. On my system, however, this took 2,172,792 global references and 7.2 seconds. But, knowing that the IDs and TimeStamps are in the same order, we can use the TimeStamps to get an ID range. Consider the following query:

SELECT ID, TS, Data
FROM TSOrder
WHERE 
     ID >= (SELECT TOP 1 ID FROM TSOrder WHERE TS >='2016-07-01 00:00:00.00000' ORDER BY TS ASC) AND 
     ID <= (SELECT TOP 1 ID FROM TSOrder WHERE TS <='2016-07-01 23:59:59.999999' ORDER BY TS DESC)

The new query completes in 5.1 seconds and takes only 999,985 global references**!

This technique can be applied more pragmatically to tables with more indexed fields and queries that have multiple WHERE clauses. The ID range generated from the subqueries can be put into bitmap format, generating blazing speed when you get a multi-index solution. The Ens.MessageHeader table is a great example where you can put this trick to work.

Let's be clear - this is an EXAMPLE of a win. If you have many conditions in the WHERE clause in the same table (and they are indexed, duh!), then this technique can give you BIGGER wins! Try it out on your queries!

* SQL Developers don't really hate this, but if the internet has taught us anything is that catchy blurbs get more traffic.

** When testing queries that return so many rows, the SMP cannot handle it, and most of the time is taken in displaying the data. The proper way to test is with embedded or dynamic SQL, running through the results, but not outputting them for time, and using the SQL Shell for your global counts. You could also use SQL Stats for that.

Class dc.TSOrder Extends (%Persistent, %Populate) { Index Extent [ Extent, Type = bitmap ]; Index iSmartTS On (TS(KEYS), TS(ELEMENTS)); Index iTS On TS; Property TS As %TimeStamp(MAXVAL = "2016-07-31 23:59:59.999999", MINVAL = "2016-07-01 00:00:00.000000"); Property Data As %String(MAXLEN = 200, MINLEN = 100); ClassMethod TSBuildValueArray( value, ByRef array) As %Status { i value="" { s array(0)=value }else{ s date=$p(value," ",1), time=$p(value," ",2), array("date")=date, array("time")=time, array("ddhh")=$p(date,"-",3)_"-"_$p(time,":",1) } q $$$OK } /// d ##class(dc.TSOrder).Fill() ClassMethod Fill(N = {30e6}) { d ..%KillExtent(), ..Populate(N), $system.SQL.TuneTable($classname(),$$$YES), $system.OBJ.Compile($classname(),"cu-d") } }

select distinct null from dc.TSOrder where TS between {ts '2016-07-01 00:00:00.00000'} AND {ts '2016-07-01 23:59:59.999999'} Performance: 2.339 seconds 2109755 global references 14768692 lines executed (the number of selected records is 968476, used the normal index) select distinct null from dc.TSOrder where for some %element(TS) (%key='date' and %value = '2016-07-01') Performance: 2.269 seconds 1936962 global references 15496098 lines executed (the number of selected records is 968476, used the "smart" index) select distinct null from dc.TSOrder where for some %element(TS) (%key='ddhh' and %value = '01-13') Performance: 0.096 seconds 80488 global references 644270 lines executed (the number of selected records is 40239, used the "smart" index)

Brendan Batchelder · Jul 20, 2016

This is what the Ensemble message viewer does when the search criteria includes a date/time range.

3 0

Evgeny Shvarov · Jul 23, 2016

Hi, Kyle!

Thank you, great stuff!

The guys in Russian forum says there is a bug in the line:

Property Data as %String (MAXLEN=100, MINLEN=200);

How can MINLEN be higher MAXLEN?

0 0

Clayton Lewis · Jun 1, 2018

The premise is that when you have a timestamp property that’s set at the time of row insert, there will be a guarantee of those timestamps being “in order” with respect to Row IDs.

At first glance that sounds reasonable, and is probably almost always true, but I’m not sure it’s guaranteed. Isn’t there a race condition around saving the timestamp and generating the new row ID?

That is, couldn’t you have a flow like this:

Process 1, Step 1: Call %Save. In %OnSave: set ..CreatedAt = $zts (let’s say this gives me 2018-06-01 16:00:00.000)

Process 2, Step 1: Call %Save. In %OnSave: set ..CreatedAt = $zts (let’s say this gives me 2018-06-01 16:00:00.010) << +10ms

Process 2, Step 2: Generate new Row ID using $increment, and complete %Save (let’s say this gives me RowID = 1)

Process 1, Step 2: Generate new Row ID using $increment, and complete %Save (let’s say this gives me RowID = 2)

Is that likely? Definitely not, but I don't think it's impossible.

Actually, it might be fairly likely in an ECP environment where multiple ECP Clients are inserting data into the same table, one reason being that system clocks could be out of sync by a few milliseconds.

Does that make sense, or am I missing something? For example, would this all be okay unless I did something dumb with Concurrency? If so, would that still be the case in an ECP environment?

Robert Cemper Jun 2, 2018 to Clayton Lewis

Clayton,
I agree with you, especially for fast inserting multi-server environments.
In "slow" environments there is less risk.
It depends on where you set your timestamp.
So %OnBeforeSave might provide the smallest possible gap.

Vitaliy Serdtsev · Jun 4, 2018

Here is my solution, which has a number of advantages:

there are no restrictions to the order of ID
you can search in queries for the entire timestamp or for parts of it
standard, reliable and proven Caché features are used
this works very, very fast.

See my article for details: Indexing of non-atomic attributes

Results from SMP:

1 0

Eduard Lebedyuk Jun 4, 2018 to Vitaliy Serdtsev

Maybe

Property TS As List Of %TimeStamp;

UPD. Nevermind, got it.

Developer Commu... · Sep 6, 2019

The article is considered as InterSystems Data Platform Best Practice.

Anil Kumar Agrawal · Sep 1, 2020

Assumption here is that the Timestamps and IDs are in the same order, in that case you don't need the order by clause in the where.

SELECT ID, TS, Data FROM TSOrder
WHERE ID >= (SELECT TOP 1 ID FROM TSOrder WHERE TS >='2016-07-01 00:00:00.00000')
AND ID <= (SELECT TOP 1 ID FROM TSOrder WHERE TS >='2016-07-01 23:59:59.999999')

John Bevan Jan 9, 2023 to Anil Kumar Agrawal

The purpose of the `order by` in the subqueries is to ensure you get the right ID back - otherwise you'll get any ID that matches the criteria. Alternatively you could change the query to use `MIN(ID)` / `MAX(ID)` instead of `TOP 1 ID`; then you can drop the `order by` clause (that may be faster too for older SQL versions... looking at the current version it seems SQL generates the same query plan for both, so the engine's spotted this optimisation interally).

There are risks with this approach though - say you introduce an additional filter, you'd need to include that on both your main query and your sub queries to ensure things match up.

Additionally, you rely on the ID and Date always being in sync... Sometimes even if that should be true, it's not (e.g. maybe someone's imported data from another system, keeping the dates from the original system; but without updating all of the IDs, because they're just IDs and should have no functional meaning)... Now your assumption's wrong, and you've got a really hard to trace bug for anyone unaware of this trick.

A cleaner approach may be to create an index over your date column (with additional columns in there if they're also used in the query). The additional overhead of an index will generally be pretty minimal; especially if it's only on the date field, so new rows will always be added on the end. The benefit will be significant when you're doing simple `>=`, `<=` or `between` type operations. You may even find it performs better, as whilst the ID is typically the clustered index (and so lives in the same file as the table's data), this custom index can be put on seperate storage (i.e. a separate file to the table's data), so may have less contention / better performance.