I have a file that's too large to fit into memory, from which I need to strip certain characters (control characters to be precise). My current function looks like this:
$old = fopen($file, 'r');
$new = fopen($tmpFile, 'w');
while (!feof($old)) {
fwrite($new, preg_replace('/[^\P{Cc}\t\r\n]/u', '', fgets($old)));
}
rename($tmpFile, $file);
This works fine in most cases. A possible problem though is that fgets
reads an entire line. Some files I process are literally huge one-liners, which would still cause memory issues.
This can be fixed using fread
, with a chunk size of say 8192. However now the text I feed preg_replace
could be cut-off multibyte characters.
I've been thinking how I can fread
while preserving multibyte characters, but I haven't found a good solution yet. Any help would be awesome.
Possible solution
While I've solved the problem in a different way, I'm still curious about my original question: how to do a mb-safe fread
? I think a function like this could work:
- Read a chunk of bytes with
fread
- Inspect the last byte, check if it's part of a multibyte sequence. If not, stop here.
- Keep reading bytes until the last byte is not part of a multibyte sequence, or ends the current sequence.
Step 2 could probably use some logic like this, but I'm not that experienced with unicode that I know how to.