0

I am looking for a PHP function to sanitize strings into safe and valid file names with no directory separators (slashes).

Ideally it should be reversible, and it should not scramble the name more than necessary.

Of course I want to prevent intentional directory traversal attacks. But I also want to prevent subfolders being created.

I figured that urlencode() would work, but I wonder if this is sufficient, and/or if there is something better or more popular.

Also if there is something that works equally well on Windows (backslash as directory separator) - so the solution would be portable.

Use case / scenario:

As part of a data import, I want to download files from remote urls into the local filesystem. The urls are from a csv file. Most of them are ok, but they may contain more slashes than expected.

E.g. most of them are like this:
https://files.example.com/pdf/12345.pdf

But then individual files might be like this:
https://files.example.com/pdf/1/2345.pdf

The files should all go into the same directory, e.g. https://files.example.com/pdf/12345.pdf -> /destination/dir/12345.pdf

A file like 1/2345.pdf should not result in a subdirectory. Instead, the / should be escaped in some (reversible) way. E.g. with urlencode() this would be 1%2F2345.pdf.

donquixote
  • 3,947
  • 1
  • 24
  • 46
  • 1
    Do you have any other requirements? Readability? Two-way conversion? (i.e. get original URL back from filename?) Url encoding is used by many others, too. – Kaii Feb 28 '16 at 22:02
  • "Ideally it should be reversible, and it should not scramble the name more than necessary" - part of the original question :) – donquixote Feb 28 '16 at 23:24
  • A good answer is one that is useful not just for me but for other visitors. It could focus on the reversible case first (where urlencode() might be the solution of choice) and then suggest one ore more alternatives for people with slightly different requirements. – donquixote Feb 28 '16 at 23:27
  • And, just saying: A "reversible" solution also has the advantage that it prevents name clashes. – donquixote Feb 28 '16 at 23:28
  • just updated my answer, take a look – Muhammed Mar 02 '16 at 03:26

3 Answers3

2

You could create a set of replacements. For example, you could make the / char that appears in a filename be represented with something else like "(slash)". Simply use str_replace to to switch between looking up a filename and encoding a filename into a url. This is only one example.

Jake Psimos
  • 640
  • 3
  • 10
2

This should help you.

Input: https://files.example.com/pdf/1/2345.pdf

Output: pdf_1_2345.pdf

$url = 'https://files.example.com/pdf/1/2345.pdf';
$parse = parse_url($url);

//get path, remove first slash
//$path: pdf/1/2345.pdf
$path = substr($parse['path'],1);

//result becomes: pdf_1_2345.pdf
$result = str_replace('/','_',$path);

EDIT: The best bet is to store remote file url in the database, hashing its value (using md5 or similar) and saving file under that name locally, storing that hashed value in the database too.

This is your best bet, this way you can always know which remote file corresponds to your local file, and vice versa, and you won't have to deal with filenames locally, as they could be whatever you want (as long as you keep them in check for uniqueness)

Database Table:
--------------------
| id | remote_url                  | local_name     |
-----------------------------------------------------
| 1  | http://example/.../123.pdf  | sdflkfd..dl.pdf|

You get the idea.

Muhammed
  • 1,512
  • 7
  • 18
  • str_replace() is not reversible, but it is a valid solution. I don't know who was first with this, so +1 to both. – donquixote Feb 28 '16 at 23:31
  • if you like my answer, please accep.t – Muhammed Feb 28 '16 at 23:37
  • oh I see, for reversible, simply use a unique string instead of _.. and if that unique string is present in that file name, you chose another one automatically. chose something like _=DIR=_ , I am sure no filename will have that:)) But it is a valid name. – Muhammed Feb 28 '16 at 23:41
  • What makes any one string more unique than another? And then what do I do if the second string is also present in the file? I am teasing, but if you follow through on this you will get to something similar to the already-existing reversible string transformations. – donquixote Feb 28 '16 at 23:56
  • noone will have a string like I posted above, if you are supercrazy, once you have all your urls in place, you can check against that string, if it exists you simple chose another one and check (by adding any arbitrary character to the end of that string), I am pretty sure you won't have some crrazy _=DIR=_ inside any filename. – Muhammed Feb 28 '16 at 23:59
  • I cannot alter the program depending which data is thrown at it. Imagine I want to publish this as a library for others to use. Also imagine I have some data that was converted with an older version of this, and some other data that was converted with the new version. Now I have to be careful when I reverse / decode the data. – donquixote Feb 29 '16 at 00:13
  • My point is, reversible string transformations do already exist, but you are describing is not one of them. If I want reversible, I rather use something that works 100% instead of just 99.9%. If I don't want something reversible, then str_replace() seems good enough. urlencode() seems ok for the reversible case, but I was wondering if there are any reasons to use something else instead. – donquixote Feb 29 '16 at 00:18
  • And of course, the string `_=DIR=` is already a lot more unpleasant to look at than what I would get with `urlencode()`.. – donquixote Feb 29 '16 at 00:19
  • Well, the database solution is still more complex / has more dependencies than just doing urlencode(). An accepted answer would have to explain why a proposed solution is better than urlencode(). Or, of course, simply confirm that urlencode() is the way to go (and optionally explain why). There can be situations where one would prefer this database solution, but I do not think one can generally call this "the best bet". – donquixote Mar 02 '16 at 05:41
  • Renaming files will always include complications, if you are not binding new renamed file to a remote url somehow. Thats where database comes in. In your situation you may want to go without database, but saving files in urlencode locally is the last option i personally would prefer. Something will break eventually and u will be wondering what that file is. Database is not necessary a complex dependency, but rather a piece of mind, and a well reliable solution. – Muhammed Mar 02 '16 at 05:43
  • hashing is the only fool proof way I know to do this. Using replace or something like that is relying on the hackers to not be outwit your replace function. They ultimately will given time. – Jamie Marshall Dec 14 '20 at 22:02
0

You can use this function, it replaces all directory separators with an underscore.

function secureFilePath($str)
{
    $str = str_replace('/', '_', $str);
    $str = str_replace('\\', '_', $str);
    $str = str_replace(DIRECTORY_SEPARATOR, '_', $str); // In case it does not equal the standard values
    return $str;
}
cnmicha
  • 114
  • 3
  • 10