[nycphp-talk] iterating through a multibyte string
Dan Cech
dcech at phpwerx.net
Wed Jan 13 13:54:33 EST 2010
Rob Marscher wrote:
> On Jan 13, 2010, at 12:44 PM, John Campbell wrote:
>> You forgot
>> mb_internal_encoding("UTF-8");
>>
>> without that, mb_substr is just an alias for substr
>
> Thanks, John. I thought I had that set in my php.ini - but I must have overwritten my php.ini with a new install since then.
Good catch! I missed that too...
>> my results look like:
>>
>> normal iteration took 0.64724087715149
>> mb_substr method took 16.471849918365
>> mb_substr method with shortening the string took 21.613878965378
>> preg_split method took 1.927277803421
>>
>> Dan is the winner. preg_split always runs in linear time. Both of
>> the mb_substr are O(N^2), because the first step in mb_substr is
>> splitting the string into array. It is not as intelligent as I
>> initially assumed.
>
> Thanks for the analysis! I got similar results on the new run too.
I worked up a quick alternative that avoided mb_substr for calculating
$rest:
for ($i = 0; $i < $repeats; $i++) {
$length = mb_strlen($str);
$newStr = '';
$rest = $str;
while ($rest) {
$c = mb_substr($rest, 0, 1);
$newStr .= $c;
$rest = substr($rest,strlen($c));
}
}
as long as you don't have mbstring.func_overload enabled it is much more
efficient than shortening the string using mb_substr:
normal iteration took 0.95997190475464
mb_substr method took 19.002305984497
mb_substr method with shortening the string took 25.623261928558
mb_substr method with shortening the string using substr took
6.5963559150696
preg_split method took 2.5313749313354
but still can't beat preg_split, most likely because of the overhead
involved in overwriting $rest on every pass through the loop.
Dan
More information about the talk
mailing list