AR query performance

mindplay · May 12, 2012, 3:18am

Programmers (and ORMs) usually optimize their query-strategy for as few queries and roundtrips as possible - we sometimes forget that the database server may in fact be on a different physical machine, and suddenly the data-volume may influence the response-time more than the number of queries.

Join-statements merely for the sake of getting all the data in one round-trip actually don’t make much sense, when you really think about it - you’re often transfering and traversing loads of duplicate data, putting more load on the network and adding memory overhead.

That’s not really why we have join-statements in the first-place - they exist for situations where you need to obtain a product of some sort - e.g. something involving sum, avg, count, min or max functions. Join queries are not really an optimization-tool, but most ORMs treat them as such.

But there is in fact a way to execute multiple queries in one round-trip, at least with the mysqli-extension:

http://php.net/manual/en/mysqli.multi-query.php

This does not appear to be possible with PDO, which, I’m guess, is why nobody uses it.

But think about it - if you have 3 customers and 50 orders, and you join those, selecting all the fields, you’re actually sending each customer’s data more than 16 times repeatedly. This data has to be serialized by the database-server, transported over the network (which is worse if your database server is on a different physical server from your application-server), then unserialized by the database-client, and then subsequently you have to traverse all that data in PHP (or some of it, but all of it has to go into memory) and weed out the non-unique customer records.

Does that sound efficient?

Much for efficient and simple would be a batch of two queries that select the 50 orders and 3 customers separately.

Do you think that’s possible?

Da_Sourcerer · May 12, 2012, 12:08pm

Quick note without having read your entire post: At least MySQL allows the traffic to be compressed, which is most useful if client and server are not on the same (physical) machine.

Da_Sourcerer · May 12, 2012, 12:15pm

Ok, got it. I think a lot of this is already covered by the lazy-loading approach Yii’s current AR implementing is using. So the JOINs will mostly kick in if you explicitly fore Yii to use eager loading via ->with(…).

As for the data duplication: I do not think this can be entirely avoided. But it sounds like you would profit from identity maps.

mindplay · May 13, 2012, 3:33pm

That’s sort of my point. You can choose between lazy-loading with many round-trips - or eager-loading with data-duplication.

With the correct APIs, this could be avoided - you could have one round-trip (eager loading) but no data-duplication. Unfortunately, the multi-query facility does not seem to be the answer, as you can’t predict what else you need to load without another round-trip - since the queries are independent queries with individual result-sets. At least not as far as I can figure.

This brings me back to what I’ve been thinking for some time now - that DBMS APIs really are designed for database administrators (or DBA tools like graphical front-ends) and not for software. Where are the APIs that were designed for software? Mainstream DMBS do not have any… odd right? being that DBMS are much more frequently used for back-end persistence for software, than for DBA front-end tools…

zilles · May 15, 2012, 5:21pm

Doesn’t the default mechanism that the current Yii uses already address your data throughput issue?

Unless you use the "together" flag on a with query:

So, true, it takes 2 or maybe 3 queries to get your results instead of the one you’re talking about, but there is no duplication of data, and you’re avoiding the n query problem.

mindplay · May 18, 2012, 2:27am

Well, not quite - I’m talking also about the number of roundtrips. You send one request, and you have to wait for the response before you can construct the next query and send that - and so forth.

The waiting is pointless - you already know your data-access pattern in advance; unfortunately DBMS, their APIs, and SQL at large, were not designed for software, but for database administrators and DBA tools. Actually I saw somebody else pointing that out in an article today:

http://saucelabs.com/blog/index.php/2012/05/goodbye-couchdb/

Although I thought this article was largely garbage, this particular observation is one I’ve been pointing out a lot lately. Your ORM is actually a code-generator. What’s worse, your ORM generates SQL, and your DBMS then parses that code and compiles it into an “execution plan”, which is really a program. But the truth of the matter is, the entire process is really pointless, because your software does not sporadically come up with new queries on it’s own - your applications have predictable data-access patterns, with the possible exception of a few applications that allow arbitrary query-construction. Applications such as DBA tools. Heh.

Da_Sourcerer · May 18, 2012, 8:31am

Hm, I’m not sure that actually holds … What will be sent over the wire is not the raw “SELECT * FROM …” but a binary protocol containing already (pre-)parsed queries. Then, there are prepared statements that further reduce parsing time. Predictable access patterns are covered by some quite good query caches etc. Honestly, I don’t quite see where you’re going here. Your charge against SQL sounds really like a charge against n-tier application design.