Date range queries going too slow for you? SQL Performance got you down? I have one weird trick that might just help you out! (SQL Developers hate this!)*
If you have a class that records timestamps when the data is added, then that data will be in sequence with your IDKEY values - that is, TimeStamp1 < TimeStamp2 if and only if ID1 < ID2 for all IDs and TimeStamp values in table - then you can use this knowledge to increase performance for queries against TimeStamp ranges. Consider the following table:
Class User.TSOrder extends %Persistent
{
Property TS as %TimeStamp;
Property Data as %String (MAXLEN=100, MINLEN=200);
Index TSIdx on TS;
Index Extent [type=bitmap, extent];
}
Populating this with 30,000,000 random rows with dates over the last 30 days, will give you 1,000,000 rows per day. Now if we want to query the information for a given day you might write the following
SELECT ID, TS, Data
FROM TSOrder
WHERE
TS >= '2016-07-01 00:00:00.00000' AND
TS <= '2016-07-01 23:59:59.999999'
A reasonable query, to be sure. On my system, however, this took 2,172,792 global references and 7.2 seconds. But, knowing that the IDs and TimeStamps are in the same order, we can use the TimeStamps to get an ID range. Consider the following query:
SELECT ID, TS, Data
FROM TSOrder
WHERE
ID >= (SELECT TOP 1 ID FROM TSOrder WHERE TS >='2016-07-01 00:00:00.00000' ORDER BY TS ASC) AND
ID <= (SELECT TOP 1 ID FROM TSOrder WHERE TS <='2016-07-01 23:59:59.999999' ORDER BY TS DESC)
The new query completes in 5.1 seconds and takes only 999,985 global references**!
This technique can be applied more pragmatically to tables with more indexed fields and queries that have multiple WHERE clauses. The ID range generated from the subqueries can be put into bitmap format, generating blazing speed when you get a multi-index solution. The Ens.MessageHeader table is a great example where you can put this trick to work.
Let's be clear - this is an EXAMPLE of a win. If you have many conditions in the WHERE clause in the same table (and they are indexed, duh!), then this technique can give you BIGGER wins! Try it out on your queries!
* SQL Developers don't really hate this, but if the internet has taught us anything is that catchy blurbs get more traffic.
** When testing queries that return so many rows, the SMP cannot handle it, and most of the time is taken in displaying the data. The proper way to test is with embedded or dynamic SQL, running through the results, but not outputting them for time, and using the SQL Shell for your global counts. You could also use SQL Stats for that.
+1
This is what the Ensemble message viewer does when the search criteria includes a date/time range.
Hi, Kyle!
Thank you, great stuff!
The guys in Russian forum says there is a bug in the line:
Property Data as %String (MAXLEN=100, MINLEN=200);
How can MINLEN be higher MAXLEN?
The premise is that when you have a timestamp property that’s set at the time of row insert, there will be a guarantee of those timestamps being “in order” with respect to Row IDs.
At first glance that sounds reasonable, and is probably almost always true, but I’m not sure it’s guaranteed. Isn’t there a race condition around saving the timestamp and generating the new row ID?
That is, couldn’t you have a flow like this:
Process 1, Step 1: Call %Save. In %OnSave: set ..CreatedAt = $zts (let’s say this gives me 2018-06-01 16:00:00.000)
Process 2, Step 1: Call %Save. In %OnSave: set ..CreatedAt = $zts (let’s say this gives me 2018-06-01 16:00:00.010) << +10ms
Process 2, Step 2: Generate new Row ID using $increment, and complete %Save (let’s say this gives me RowID = 1)
Process 1, Step 2: Generate new Row ID using $increment, and complete %Save (let’s say this gives me RowID = 2)
Is that likely? Definitely not, but I don't think it's impossible.
Actually, it might be fairly likely in an ECP environment where multiple ECP Clients are inserting data into the same table, one reason being that system clocks could be out of sync by a few milliseconds.
Does that make sense, or am I missing something? For example, would this all be okay unless I did something dumb with Concurrency? If so, would that still be the case in an ECP environment?
Clayton,
I agree with you, especially for fast inserting multi-server environments.
In "slow" environments there is less risk.
It depends on where you set your timestamp.
So %OnBeforeSave might provide the smallest possible gap.
Here is my solution, which has a number of advantages:
See my article for details: Indexing of non-atomic attributes
Results from SMP:
Maybe
?
UPD. Nevermind, got it.
Assumption here is that the Timestamps and IDs are in the same order, in that case you don't need the order by clause in the where.
SELECT ID, TS, Data FROM TSOrder
WHERE ID >= (SELECT TOP 1 ID FROM TSOrder WHERE TS >='2016-07-01 00:00:00.00000')
AND ID <= (SELECT TOP 1 ID FROM TSOrder WHERE TS >='2016-07-01 23:59:59.999999')
The purpose of the `order by` in the subqueries is to ensure you get the right ID back - otherwise you'll get any ID that matches the criteria. Alternatively you could change the query to use `MIN(ID)` / `MAX(ID)` instead of `TOP 1 ID`; then you can drop the `order by` clause (that may be faster too for older SQL versions... looking at the current version it seems SQL generates the same query plan for both, so the engine's spotted this optimisation interally).
There are risks with this approach though - say you introduce an additional filter, you'd need to include that on both your main query and your sub queries to ensure things match up.
Additionally, you rely on the ID and Date always being in sync... Sometimes even if that should be true, it's not (e.g. maybe someone's imported data from another system, keeping the dates from the original system; but without updating all of the IDs, because they're just IDs and should have no functional meaning)... Now your assumption's wrong, and you've got a really hard to trace bug for anyone unaware of this trick.
A cleaner approach may be to create an index over your date column (with additional columns in there if they're also used in the query). The additional overhead of an index will generally be pretty minimal; especially if it's only on the date field, so new rows will always be added on the end. The benefit will be significant when you're doing simple `>=`, `<=` or `between` type operations. You may even find it performs better, as whilst the ID is typically the clustered index (and so lives in the same file as the table's data), this custom index can be put on seperate storage (i.e. a separate file to the table's data), so may have less contention / better performance.