String Gets Truncated On Insert

I have a really weird problem I hope someone can help me with:

I’m trying to load a webpage, and save the text from the page into a MySQL database. Before saving I’m replacing all control characters and most special characters with a single space. The text string comes out just right after processing, but whenever I try to save it to my database, it gets truncated. To show the problem in code:




$sql = "INSERT INTO " . RawHtml::model()->tableName() . "(raw_text) VALUES (:raw_text)";

$cmd = Yii::app()->db->createCommand($sql);

$data[':raw_text'] = $model->fulltext;

// START DEBUG

$data[':raw_text'] = "";

$str = "ab' -";

for($i=0; $i<10000; $i++) {

    $data[':raw_text'] .= $str[rand(0,4)];

}

$data[':raw_text'] .= "xxx";

// END DEBUG

$cmd->execute($data);

die($data[':raw_text']);



Here I’m inserting some random characters into $data[’:raw_text’] for debugging. This works fine, and the string is of the appropriate length in the database and is printed in full in the die(). However, when I remove the debugging code, the string inserted into the db gets truncated to about 4000-5000 characters. The die() still prints the whole string, without any truncation. The db column is a LONGTEXT.

Does anyone have any idea what’s happening here? :blink:

What is $model->fulltext?

$model is an instance of the model this function is in (called Sources), instantiated with


$model = new Sources();

$fulltext is


public $fulltext

in Sources. The attribute is populated in:




    private function fetchSite($url) {

        mb_internal_encoding( 'UTF-8'); 

        mb_regex_encoding( 'UTF-8');

        $cu = curl_init();

        $timeout = 5;

        curl_setopt($cu, CURLOPT_URL, $url);

        curl_setopt($cu, CURLOPT_RETURNTRANSFER, 1);

        curl_setopt($cu, CURLOPT_CONNECTTIMEOUT, $timeout);

        $data = curl_exec($cu);

        curl_close($cu);

        $cu = null;

        if ($data === false || strlen($data) < 500) return false;

        $data = preg_replace('~<\s*\bscript\b[^>]*>(.*?)<\s*\/\s*script\s*>~is', '', $data);

        $data = preg_replace('~<\s*\bstyle\b[^>]*>(.*?)<\s*\/\s*style\s*>~is', '', $data);

        $data = preg_replace('/<\/option><option(.)+>/', " ", $data);

        $data = trim(strip_tags($data));

        $data = preg_replace('/([\xA7-\xDF\xE4-\xFF\x1-\x20]+)/', ' ', $data);

        //$data = preg_replace('/(?:\s\s+|\n|\t|\r)/', " ", $data);

        //$data = preg_replace('/\s+/', " ", $data);

        $this->fulltext = $data;

    }



that’s called earlier in the code. $model->fulltext is not modified elsewhere. And when outputting $data[’:raw_text’] which is a copy of $model->fulltext in die() the value is not truncated, so something is happening while executing the query.

I would debug it this way:

  1. Try to submit test string (for example, 10000 letters “a”), not a real text. Data can be truncated on some weird char, that you’ve missed during replacement.

  2. Check post_max_size

And, btw, here’s your answer I suppose: http://stackoverflow.com/questions/11645385/mysql-pdo-truncates-the-data

This is a step in the right direction, but still didn’t solve the issue. I’ve set emulatePrepare to false in the main config, and I’m also disabling emulates with




$db = Yii::app()->db;

$db->emulatePrepare = false;



No effect. The character that seems to be the culprit is the pound sign (£), so I thought that I’ll try removing it for debugging, but even that doesn’t work. I’ve tried using preg_replace, str_replace, but the pound sign just wont budge. The source text is UTF-8, my script is UTF-8, php internal mysql table encoding is UTF-8…

Also, I tried inserting some pound signs into the db from the same script, but that doesn’t work any better. Any ideas?

Just to clarify, this script is being run as a console command. And yes, I’m using console.php as the config file, but the same db configuration exists also in main.php.

Ok, have you tried to submit a test string instead of real text? what’s the result?

Yes:




$sql = "INSERT INTO " . RawHtml::model()->tableName() . "(raw_text) VALUES (:raw_text)";

$db = Yii::app()->db;

$db->emulatePrepare = false;

$cmd = $db->createCommand($sql);            

$data[':raw_text'] = "";

$str = "äÖ' £";

for($i=0; $i<10000; $i++) {

    $data[':raw_text'] .= $str[rand(0,4)];

}

$data[':raw_text'] .= "xxx";            

$cmd->execute($data);



It doesn’t work. All multibyte characters seem to break the insert. Outputting the string to console comes up with “¤’ÃÃäÃÃäÅ”.

That’s crap.

Ok, checklist

And try to do it without Yii (just php + pdo).

It isn’t working even without Yii. Here’s the code:




mb_internal_encoding('UTF-8'); 

mb_regex_encoding('UTF-8');

mb_http_output('UTF-8');

$sql = "SET NAMES 'utf8'; INSERT INTO tbl_raw_html (raw_text) VALUES (:raw_text)";

$db = new PDO("mysql:host=localhost;dbname=...;charset=utf8", ...);

$data[':raw_text'] = "";

$str = "äÖ' £";

for($i=0; $i<10000; $i++) {

    $data[':raw_text'] .= $str[rand(0,4)];

}

$data[':raw_text'] .= "xxx";

$stmt = $db->prepare($sql);

$stmt->execute($data);

die($data[':raw_text']);



Same behaviour as before. I’ve also done everything on the checklist. I’m losing hope. :-[

You may be loading a different version of php.ini when launching from the console.

See if any of the information on this page helps.

Overkill


$dbHandle = new PDO(

    "mysql:host= localhost;dbname=...;charset=utf8", ...user, ...pwd,

    array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES 'utf8'")

);

$dbHandle -> exec("SET CHARACTER SET utf8");

Btw what’s the encoding of source files?

Thanks! Now we’re getting somewhere! Indeed the php.ini is different when launching from the console, so I added these lines to my script:




ini_set("mbstring.func_overload", "7");

ini_set("mbstring.internal_encoding", "UTF-8");



Then I changed my test string generator to:




$data[':raw_text'] = "ASD";

$str = "asdÖ£ä";

for($i=0; $i<10000; $i++) {

    $data[':raw_text'] = $data[':raw_text'].mb_substr($str, rand(0,5), 1);

}



And this works! However the actual data I want to insert still isn’t working, even though the data is UTF-8 encoded to begin with, but I’ve tried converting it to UTF-8 too just to be sure, to no avail. All my source files are UTF-8 encoded.

I figured it out! In fact, it was my own fault it wasn’t working. When removing control characters from the source data, I was using this:


$data = preg_replace('/([\xA7-\xDF\xE4-\xFF\x1-\x20]+)/', ' ', $data);

And apparently that was the mistake. Removing that line makes it work.