What the real limitation of ActiveRecord?

Hello. Now, I’m using Yii for DB with alot of records. And i can’t load 200K of ‘items’. I tried:

  1. pure ActiveRecord using ‘with’ and other relative feature (but i needed only 2-3 fields actually) - no succes. 2) After that i tried FindAllBySQL to get pure data and also no success

in first case i got ‘not enough memory’, although i have 512Mb limitation in my php.ini settings

in second case it loads about few minutes, although raw query via phpMyAdmin - only few seconds.

Any recommendation? I don’t want to replace all FindAll for FindAllBySQL and i would like to load it as fast as possible.

It depends on how big the table is. With this many items, using DAO together with array will definitively be more practical.

Do you mean use CModel and after loading via DAO fills it as arrays (model->attributes)?!

No. Just use DAO without creating any models.

This seems like a common problem.

I’m into my fifth Yii project, and this has come up two or three times by now, usually when generating complicated reports, exporting data, performing batch processes such as mass e-mail, etc.

I’ve seen this problem mentioned many times on the forum, and it’s something that everyone is going to encounter again and again, for sure…

There are many reasons why using DAO rather than AR could be extremely impractical - in some cases impossible.

For one, your views (particularly partials) are probably written so they expect model objects. You may be relying on other model-specific features as well, such as behaviors, calculated attributes, etc.

Getting to your data probably is really the least of your concerns - by the time you go around the models and hardcode all your queries get to your data, you may realize that you have to duplicate business logic that was embedded in your models.

The bottom line is, the functionality provided by your models is usually very important, and usually, the rest of your application is designed to expect model objects.

For another, you may already be well into your project - having written hundreds or thousands of lines of code, before you realize that AR can’t cope. You may even have a running project that runs fine for a while, and as you accumulate more data, your application slowly outgrows the memory limitations imposed by AR.

What’s really required here, is a way to iterate through large numbers of records without constructing them all in advance - that’s not how a relational database works, so AR is imposing an artificial limitation.

I would propose one of two approaches:

  1. An alternative query method that gives you an iterator, rather than an array of objects - you can then (in many cases) simply pass the iterator to your view, where the objects will be loaded, rendered, and destroyed, one at a time, rather than loaded and constructed in advance.

  2. Have the existing methods return iterators, rather than actual arrays. Iterators can largely emulate arrays - for example, you can count() them without loading the result set. Random access would be hard to implement, but would probably very rarely be used (can’t think of an example).

Either approach would make Yii considerably less memory intensive, overall - eliminating a lot of unnecessary memory fragmentation and increasing server performance.

Either approach puts a limitation on the ‘index’ feature of AR, where you can specify which column you want to use as the index for the returned array - you can’t do that, because you don’t know what the indexes are going to be.

In the light of this, if the second approach was chosen, maybe it should be implemented so that you have to specifically ask for progressive loading - e.g. an iterator - rather than having it be the default.

So let’s say you have this:




$users = User::model()->findAll(); // preloads and gives you objects, as usual



With delayed loading, it might look like this:




$users = User::model()->delayed()->findAll(); // gives you an iterator instead



When you foreach() through your $users with this iterator, PHP can dispose of each object as it falls out of scope, so the memory overhead is for 1 record only.

The delayed() method would need to be supported by both CActiveRecord and CActiveFinder.

On another note, a related issue is the inability to cache model instances in-memory. This is sort of the "opposite" problem of the one described above.

Let’s say you’re going to load and display a page of 50 user records - each of them have a related country record, but there are only, say, five countries in my database.

Every time I access the country relation of user record, it’s going to query over and over again for the same records.

An in-memory caching feature would keep the country records somewhere, statically, so that they only load once each - if you findByPk() repeatedly, it’s going to return the same object.

Again, this feature would need to be something you turn on explicitly, so for example:




$users = User::model()->with(array(

  'country'=>array('cache'=>true)

)->findAll();



This should cause the country records to lazy-load by default, since you can’t (or shouldn’t) load them eagerly, generating duplicates, if you expect them to stay in memory. It also would not work with delayed(), since those objects are expected to fall out of scope and be destroyed progressively.

In a sense, AR is currently optimized for an “average” number of records - not too many, and not too few. Unfortunately, you’ll find that there are many cases where “average” is not what you’re dealing with, and AR seems to fall short in many of these cases…

I am wondering under what circumstance you would use AR to deal with large amount of data.

It is fine for non-relational AR queries to return an iterator instead of an array of all records. Developers should also pay attention to the circular reference problem.

It is not so for relational AR queries (eager-loading) because to collect related objects of a particular record, all query results need to be examined.

Qiang, why you are not using CList instead array (on findAll()) and use lazy initialization of objects (get pointer on CDbDataReader and instantiating of CActiveRecord model when offsetGet() on CList is called)?! :rolleyes:

[list=1]

[*]CLazyActiveRecordList

[*]if we replace PDO by mysqli - it would be better

* mysqli_result (OR array( PDO::ATTR_CURSOR => PDO::CURSOR_SCROLL ) + FETCH_LAZY?) provides seek on result (it helps to call populateRecord() once, but not (N+1) times, where N - is not initialized record offset in dataset, N>0)

[/list]

[b]UPD (PDO mysql didn’t supports CURSOR_SCROLL natively - very BAD, but data_seek is presented like in mysql, as in mysqli too, it means, that need db layer, that would be use mysql/mysqli native functions), code without supporting seek function:

[/b]




class CActiveRecordList extends CList {

	/**

	 * @var int

	 */

	private $_c=0;

	/**

	 * @var CActiveRecord

	 */

	private $_model;

	/**

	 * @var CDbDataReader

	 */

	private $_dataReader;


	/**

	 * @param CActiveRecord $model

	 * @param CDbDataReader $dataReader

	 */

	public function __construct($model,$dataReader){

		$this->_model=$model;

		$this->_dataReader=$dataReader;

		$this->_c=$dataReader->getRowCount();

		$this->setReadOnly(true);

	}


	public function itemAt($index){

		if(($c=parent::getCount())<=$index && $this->_c>$index)

			$this->populateRecords(-~($index-$c));

		return parent::itemAt($index);

	}


	/**

	 * @param int $index

	 */

	private function populateRecords($count){

		$this->setReadOnly(false);

		while($count--)

			$this->add($this->_model->populateRecord($this->_dataReader->read()));

		$this->setReadOnly(true);

	}


	/**

	 * @return CListIterator

	 */

	public function getIterator(){

		return new CListIterator($this);

	}


	/**

	 * @return array

	 */

	public function toArray(){

		if(($c=parent::getCount())!=$this->_c)

			$this->populateRecords($this->_c-$c);

		return parent::toArray();

	}


	/**

	 * @return int

	 */

	public function getCount(){

		return $this->_c;

	}

}



UPD: in case if we use cursor - we can keep 10-20 items in collection and load other items if we need it

As I said, it is possible to implement this feature for non-relational queries. However, this may introduce some side effects (e.g. buffered reading because you hold an active cursor).

qiang, vamp, there are other DBs like sqlite where there is no cursor support in PDO at all so implementing cursors will involve going no-PDO way.

mindplay, you’ve raised some good points about instantiating AR objects on demand. It seems it can’t be done transparently without breaking existing code behavior. Adding another method could probably do the job but it will require a lot of effort.

samdark, also native sqlite supports seeking on recordset (sqlite_seek), as for mysql/mysqli this feature is not available via PDO driver

qiang, maybe some AR settings need to be provided in configuration, lazyInstatiate etc!? :rolleyes:

vamp

Yes, I know. This looks like a good feature but, I think, it is not good enough to drop PDO and start from scratch.

I think we will not support this. If needed, the following code can be used as a workaround:




foreach($dataReader as $data)

{

     $model=MyModel::model()->populateRecord($data);

}



samdark, not drop, but create PDO-based class for native sqlite/mysql/mysqli adapters

qiang, this workaround is not bad, but not lazy populating in some case - is not good too ::)

I’m new to yii and I didn’t realize this was an issue. I am using both sqlite and mysql. So far I like how I use AR and the functionality it provides. However, I don’t want to wind up with limitations with my data access. The iterator suggestions sounds good, but it doesn’t sound like it works with sqlite. Are there any examples of using an alternative for AR with yii? If I suspect I am going to run into this issue what is suggested I use instead of AR? Is there an object oriented solution for php that addresses this? Sorry, but I’m new to php as well as yii so some of these answers may be widely known but I’m just unaware of them.

@zillabyte:

As mentioned above, you’ll only get issues when dealing with really large result sets. Something which is IMO very unlekily in a web application.

If you need to do these kind of things (like manipulate many datasets in a single request), you can still rewrite that parts of your code to use DAO instead.

Mike,

thanks for the reply. Now for a really stupid question… What is DAO? I’m guessing it stands for Data Access Object but I suspect it is referring to a specific library that you can use? Sorry for my newbiness…

Read more here

http://www.yiiframework.com/doc/guide/database.dao

/Tommy

See tri’s link. ActiveRecord is built on top of DAO which again is built on top of PDO. So you should be fine to start with AR. If some operations cause issues later, you can still try to find a better solution for that specific problem on a deeper level. But usually it’s good to stick by the principle: “Don’t try to solve problems you don’t have yet!” ;)

Well, I say “large”, but really we’re talking recordsets of maybe 1000-2000 results, which isn’t really a lot. But when those records have lots of related records, and you need to examine most of them, this takes up a lot of memory.

Not necessarily. By ordering the results correctly, you should be able to iterate through a subset of the records returned and construct the necessary objects:




while (more_results)

{

  read record

  if (data belongs to current record)

    populate related objects and add them to the object

  else

    construct the next object, break, return current object

    // we have started populating the next object, which will stay

    // in a temporary place until we come back to this method to

    // continue populating it's related objects

}



So in other words, we populate objects and related objects as usual, but every time we start into a new set of records, we need to lay that object away for a while, and return the completed object.

By ordering the results correctly, you should be able to implement this. So, if someone is querying for users with, say, cities, and they want the returned users ordered by user-name, then by city-name, you would need to add to the order-by clause, e.g. "ORDER BY user.name, city.name" gets extended to "ORDER BY user.name, user.id, city.name, city.id". This ensures the records arrive in an order where the workflow described above can be implemented.

Well, sure - but I think the overhead of this side effect is negligible compared to the overhead of holding 2000 records in memory rather than loading and disposing of them one at a time. And as said, this feature would need to be something you turn on explicitly, so presumably, if you turn it on, it’s because you’ve weighed your alternatives and you know what you’re doing…

The expanded ordering does solve the simple case as you described. It doesn’t work for more complicated cases. For example, assume A has many B and has many C. And we want to query A with both B and C. Now, if you do foreach($a->b as $B), followed by foreach($a->c as $c), you will have trouble.

By side effect, I don’t mean performance problem. I mean this feature requires buffered reading capabilities because you may hold several active cursors at the same time. Not all DBMS support this. And if supported, they also need special switch to toggle on this capability.