4

I have a file that's too large to fit into memory, from which I need to strip certain characters (control characters to be precise). My current function looks like this:

$old = fopen($file, 'r');
$new = fopen($tmpFile, 'w');

while (!feof($old)) {
    fwrite($new, preg_replace('/[^\P{Cc}\t\r\n]/u', '', fgets($old)));
}

rename($tmpFile, $file);

This works fine in most cases. A possible problem though is that fgets reads an entire line. Some files I process are literally huge one-liners, which would still cause memory issues.

This can be fixed using fread, with a chunk size of say 8192. However now the text I feed preg_replace could be cut-off multibyte characters.

I've been thinking how I can fread while preserving multibyte characters, but I haven't found a good solution yet. Any help would be awesome.

Possible solution

While I've solved the problem in a different way, I'm still curious about my original question: how to do a mb-safe fread? I think a function like this could work:

  1. Read a chunk of bytes with fread
  2. Inspect the last byte, check if it's part of a multibyte sequence. If not, stop here.
  3. Keep reading bytes until the last byte is not part of a multibyte sequence, or ends the current sequence.

Step 2 could probably use some logic like this, but I'm not that experienced with unicode that I know how to.

Peter Kruithof
  • 9,284
  • 5
  • 27
  • 42
  • I don't know how optimal this is, but you could use fgetc() to read in numChars. That way you'll be chunking by character instead of by byte. – Chad Oct 16 '14 at 16:01
  • If this file has lines in size that won't fit into memory - this is your primary problem. Go over the line, and write first script that will break the large lines into something that will actually fit without losing internal integrity. – Tymoteusz Paul Oct 16 '14 at 22:27
  • @cwscribner `fgetc` is binary-safe, not multibyte-safe. It will still break on multibyte characters. – Peter Kruithof Oct 17 '14 at 06:44
  • @Puciek I disagree on it being the primary problem: PHP is perfectly capable of buffered reading, just not in a mb manner like this. It would be a solution, but not one I'd prefer as I don't want to make assumptions about the file contents (such as splitting on certain characters, etc.) – Peter Kruithof Oct 17 '14 at 07:17

4 Answers4

1

I can't put comments yet. But an option would be to read the data in chunks like you said and use unpack('C*', $chunk) , from there you can iterate the array of bytes and find a match for your character depending on the byte sequence in the byte array. If you find a match in that array, replace or remove those bytes and pack() the string back.

P.S. : remember to reread the last several bytes in the next chunk (so you won't have any consitency isuues with the final replaced string).
I don't know if my unpack example fits your prefernces, but you could read more here : unpack doc

Here is another pointer how utf-8 encoding works in case you are using utf-8 : utf-8 encoding

Community
  • 1
  • 1
Geo
  • 33
  • 6
  • This is interesting, and I think it would work. Although I'm not sure why I would need to reread bytes, since I'm not touching the original string/file? – Peter Kruithof Oct 17 '14 at 08:29
  • @PeterKruithof Yes, you won't have to rearead the past chunk 4 bytes if you are interpreting the bits as in the utf-8 specification (or any other encoding you use). If in the last bytes of a chunk there is something missing that helps you build a character, just continue the parsing in the next chunk. I was saying to reread the last bytes so there will be a continuity in the whole file string. – Geo Oct 17 '14 at 16:42
1

My solution was fairly simple, in the end. The problem was using the preg_replace with possible cut off multibyte characters, which resulted in botched chunks.

Since I only needed to strip away control characters, which are in the ASCII range and thus single-byte, I can just as easy do a str_replace, which leaves the other bytes alone.

My working solution now looks like this:

$old = fopen($file, 'r');
$new = fopen($tmpFile, 'w');

// list control characters, but leave out \t\r\n
$chars = array_map('chr', range(0, 31));
$chars[] = chr(127);
unset($chars[9], $chars[10], $chars[13]);

while (!feof($old)) {
    fwrite($new, str_replace($chars, '', fread($old, 8192)));
}

While it does not answer my original question (which is how to do a mb-safe fread), it does solve my problem.

Peter Kruithof
  • 9,284
  • 5
  • 27
  • 42
  • In that case you should maybe think of the poor people who come here googling for an answer to _really_ doing multibyte freads and change the title of the question or something like that. ;) – scy Nov 07 '14 at 15:23
  • Well, poor people can try the possible solution I posted, and see if that works. They might even post an answer if it does! ;) – Peter Kruithof Nov 08 '14 at 17:03
1

I've spent a fair number of hours in the last few days searching for a multi-byte-safe version of PHP's fread(), fgetc(), file_get_contents(), etc.

Unfortunately, I don't think one exists, especially for very large files. So, I wrote my own (for better or worse):

Jstewmc\Chunker\File::getChunk()

Hopefully, it's not awful; it helps someone besides me; and, I don't look like a self-aggrandizing jerk on SO haha.

0

Untested. Too much to fit in a comment, but this is the gist of what I was getting at.

$old = fopen($file, 'r');
$new = fopen($tmpFile, 'w');

while (!feof($old)) {
    // Your search subject
    $subject = '';

    // Get $numChars
    for($x = 0, $numChars = 100; $x < $numChars; $x++){
        $subject .= fgetc($old);
    }

    // Replace and write to $new
    fwrite($new, preg_replace('/[^\P{Cc}\t\r\n]/u', '', $subject));

    // Clean out the characters
    $subject = '';
}

rename($tmpFile, $file);
Chad
  • 694
  • 2
  • 8
  • 26