Multibyte-safe fread in PHP

Question

I have a file that's too large to fit into memory, from which I need to strip certain characters (control characters to be precise). My current function looks like this:

$old = fopen($file, 'r');
$new = fopen($tmpFile, 'w');

while (!feof($old)) {
    fwrite($new, preg_replace('/[^\P{Cc}\t\r\n]/u', '', fgets($old)));
}

rename($tmpFile, $file);

This works fine in most cases. A possible problem though is that fgets reads an entire line. Some files I process are literally huge one-liners, which would still cause memory issues.

This can be fixed using fread, with a chunk size of say 8192. However now the text I feed preg_replace could be cut-off multibyte characters.

I've been thinking how I can fread while preserving multibyte characters, but I haven't found a good solution yet. Any help would be awesome.

Possible solution

While I've solved the problem in a different way, I'm still curious about my original question: how to do a mb-safe fread? I think a function like this could work:

Read a chunk of bytes with fread
Inspect the last byte, check if it's part of a multibyte sequence. If not, stop here.
Keep reading bytes until the last byte is not part of a multibyte sequence, or ends the current sequence.

Step 2 could probably use some logic like this, but I'm not that experienced with unicode that I know how to.

I don't know how optimal this is, but you could use fgetc() to read in numChars. That way you'll be chunking by character instead of by byte. — Chad, Oct 16 '14 at 16:01
If this file has lines in size that won't fit into memory - this is your primary problem. Go over the line, and write first script that will break the large lines into something that will actually fit without losing internal integrity. — Tymoteusz Paul, Oct 16 '14 at 22:27
@cwscribner `fgetc` is binary-safe, not multibyte-safe. It will still break on multibyte characters. — Peter Kruithof, Oct 17 '14 at 06:44
@Puciek I disagree on it being the primary problem: PHP is perfectly capable of buffered reading, just not in a mb manner like this. It would be a solution, but not one I'd prefer as I don't want to make assumptions about the file contents (such as splitting on certain characters, etc.) — Peter Kruithof, Oct 17 '14 at 07:17

score 1 · Answer 1 · edited May 23 '17 at 12:14

1

I can't put comments yet. But an option would be to read the data in chunks like you said and use unpack('C*', $chunk) , from there you can iterate the array of bytes and find a match for your character depending on the byte sequence in the byte array. If you find a match in that array, replace or remove those bytes and pack() the string back.

P.S. : remember to reread the last several bytes in the next chunk (so you won't have any consitency isuues with the final replaced string).
I don't know if my unpack example fits your prefernces, but you could read more here : unpack doc

Here is another pointer how utf-8 encoding works in case you are using utf-8 : utf-8 encoding

edited May 23 '17 at 12:14

Community

1
1

answered Oct 16 '14 at 17:20

Geo

33
6

This is interesting, and I think it would work. Although I'm not sure why I would need to reread bytes, since I'm not touching the original string/file? – Peter Kruithof Oct 17 '14 at 08:29
@PeterKruithof Yes, you won't have to rearead the past chunk 4 bytes if you are interpreting the bits as in the utf-8 specification (or any other encoding you use). If in the last bytes of a chunk there is something missing that helps you build a character, just continue the parsing in the next chunk. I was saying to reread the last bytes so there will be a continuity in the whole file string. – Geo Oct 17 '14 at 16:42

Peter Kruithof · Accepted Answer · 2014-10-17T08:58:00.397

1

My solution was fairly simple, in the end. The problem was using the preg_replace with possible cut off multibyte characters, which resulted in botched chunks.

Since I only needed to strip away control characters, which are in the ASCII range and thus single-byte, I can just as easy do a str_replace, which leaves the other bytes alone.

My working solution now looks like this:

$old = fopen($file, 'r');
$new = fopen($tmpFile, 'w');

// list control characters, but leave out \t\r\n
$chars = array_map('chr', range(0, 31));
$chars[] = chr(127);
unset($chars[9], $chars[10], $chars[13]);

while (!feof($old)) {
    fwrite($new, str_replace($chars, '', fread($old, 8192)));
}

While it does not answer my original question (which is how to do a mb-safe fread), it does solve my problem.

edited Oct 17 '14 at 08:58

answered Oct 17 '14 at 08:28

Peter Kruithof

9,284
5
27
42

In that case you should maybe think of the poor people who come here googling for an answer to _really_ doing multibyte freads and change the title of the question or something like that. ;) – scy Nov 07 '14 at 15:23
Well, poor people can try the possible solution I posted, and see if that works. They might even post an answer if it does! ;) – Peter Kruithof Nov 08 '14 at 17:03

score 1 · Answer 3 · answered Jul 03 '15 at 22:31

I've spent a fair number of hours in the last few days searching for a multi-byte-safe version of PHP's fread(), fgetc(), file_get_contents(), etc.

Unfortunately, I don't think one exists, especially for very large files. So, I wrote my own (for better or worse):

Jstewmc\Chunker\File::getChunk()

Hopefully, it's not awful; it helps someone besides me; and, I don't look like a self-aggrandizing jerk on SO haha.

score 0 · Answer 4 · answered Oct 16 '14 at 17:33

Untested. Too much to fit in a comment, but this is the gist of what I was getting at.

$old = fopen($file, 'r');
$new = fopen($tmpFile, 'w');

while (!feof($old)) {
    // Your search subject
    $subject = '';

    // Get $numChars
    for($x = 0, $numChars = 100; $x < $numChars; $x++){
        $subject .= fgetc($old);
    }

    // Replace and write to $new
    fwrite($new, preg_replace('/[^\P{Cc}\t\r\n]/u', '', $subject));

    // Clean out the characters
    $subject = '';
}

rename($tmpFile, $file);

Multibyte-safe fread in PHP

Possible solution

4 Answers4