55

I want to create a sane/safe filename (i.e. somewhat readable, no "strange" characters, etc.) from some random Unicode string (mich might contain just anything).

(It doesn't matter for me wether the function is Cocoa, ObjC, Python, etc.)


Of course, there might be infinite many characters which might be strange. Thus, it is not really a solution to have a blacklist and to add more and more to that list over the time.

I could have a whitelist. However, I don't really know how to define it. [a-zA-Z0-9 .] is a start but I also want to accept unicode chars which can be displayed in a normal way.

Peter Hosey
  • 93,914
  • 14
  • 203
  • 366
Albert
  • 57,395
  • 54
  • 209
  • 347
  • Am I correct in understanding that you want this to be internationalizable? – N_A Sep 13 '11 at 18:10
  • @mydogisbox: No, just a single (unicode) filename from the input. – Albert Sep 13 '11 at 18:24
  • 4
    “no "strange" characters… but I also want to accept unicode chars which can be displayed in a normal way.” The problem that there's an intersection between those sets. For example, if a user writes an article about [Феликс Дзержинский](http://en.wikipedia.org/wiki/Feliks_Dzerzhinsky), is that ‘р’ a Latin ‘p’ or a Cyrillic ‘p’? (Yes, they really are two different characters. Paste into UnicodeChecker to see.) – Peter Hosey Sep 13 '11 at 18:44
  • 2
    … As for why that's a “strange” character, a few years ago, there was a flurry of news and analysis reports about how phishing scammers had started using characters like that to make fake but real-looking domain names (“paypal.com”, for a made-up-just-now example). Browsers such as Safari now render such domains as “Punycode” (bit like half-base64 half-ASCII) for that reason. So, that character and the many others like it can be used for good **or** evil—and that's the problem. – Peter Hosey Sep 13 '11 at 18:51
  • Since this isn't a one-to-one character mapping, it sounds like you'll also need to check for duplicate filenames. – octern May 21 '12 at 00:27
  • 3
    Duplicate: http://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename-in-python – jmetz Mar 12 '14 at 13:36
  • -1. I don't think this question is well defined at all. "Sane" and "strange" mean nothing. Either accept anything that the filesystem actually accepts (in which case this question is a duplicate), or accept a clearly defined subset of ascii (in which case this question is trivial). – Clément Dec 12 '15 at 01:29
  • @Clément: Ofc it's not well defined. The question was also in the sense if there maybe is some straight-forward answer, so your comment is kind of the answer "no, there is not" - but I don't know that. Maybe Unicode defines something like invisible (strange) chars, or canonical chars or so. I don't know. Anyway, the accepted answer is kind of straight-forward and I'm happy with it now. And it's neither the two cases you describe, it's much better. – Albert Dec 12 '15 at 13:31

11 Answers11

76

Python:

"".join([c for c in filename if c.isalpha() or c.isdigit() or c==' ']).rstrip()

this accepts Unicode characters but removes line breaks, etc.

example:

filename = u"ad\nbla'{-+\)(ç?"

gives: adblaç

edit str.isalnum() does alphanumeric on one step. – comment from queueoverflow below. danodonovan hinted on keeping a dot included.

    keepcharacters = (' ','.','_')
    "".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip()
wallyk
  • 53,902
  • 14
  • 79
  • 135
Remi
  • 17,911
  • 8
  • 51
  • 41
  • 1
    Oh cool, yea, I didn't knew `str.isalpha()` also works for such unicode chars. – Albert Sep 13 '11 at 18:25
  • Doesn't this also omit spaces? – Peter Hosey Sep 13 '11 at 18:37
  • It does actually... Is that a problem here for @Albert? Otherwise just add `or x==' '`. The overhead is small because it will be the last thing to look for. – Remi Sep 13 '11 at 18:45
  • @Peter: Yea, but from this answer, it was easy enough to make my own function exactly fitting my needs. `c.isalpha()` is close enough to what I searched for. Of course, it's still not perfect (and you gave a good example in your comment on the question about different "p"s). – Albert Sep 13 '11 at 19:42
  • 2
    This isn't safe on Windows. First off, you need to protect against legacy device filenames like CON and NUL. Second, what about case sensitivity? You might overwrite another file on accident. Third, filenames with spaces at the end aren't handled correctly by Python on Windows. That's at least three ways to break it off the top of my head. – Antimony Oct 01 '12 at 14:41
  • for your 3rd remark, I added `rstrip()`. As for CON and NUL etc., perhaps the desired file can be checked to end only with one out of a fixed list of allowed file extensions? As for case sensitivity and file-overwrite: the filename is a valid name at least, next step should be checking if the file not already exists before you overwrite (e.g. use `os.path.exists()`) – Remi Oct 03 '12 at 09:29
  • There even is `str.isalnum()` which does alphanumeric on one step. – Martin Ueding Dec 08 '12 at 21:24
  • 2
    To *not* strip out the period (full stop) `.` try ` "".join(c for c in filename if c.isalnum() or c in [' ', '.']).rstrip()` – danodonovan Apr 12 '13 at 11:42
  • Unicode characters can cause problems on some older filesystems - it's probably best to use unidecode or similar to convert characters to safe ASCII characters. Also, it [might be a good idea to remove spaces](http://stackoverflow.com/a/2306003/210945). – naught101 Apr 29 '14 at 01:28
  • A tiny subset of the potentially **dangerous** file names this would pass through: a 5 gigabyte-long file name, `.......`, `nul`, `dir.exe`, the empty string. – Bob Stein Nov 25 '15 at 11:34
12

My requirements were conservative ( the generated filenames needed to be valid on multiple operating systems, including some ancient mobile OSs ). I ended up with:

    "".join([c for c in text if re.match(r'\w', c)])

That white lists the alphanumeric characters ( a-z, A-Z, 0-9 ) and the underscore. The regular expression can be compiled and cached for efficiency, if there are a lot of strings to be matched. For my case, it wouldn't have made any significant difference.

Ngure Nyaga
  • 2,729
  • 1
  • 16
  • 28
7

There are a few reasonable answers here, but in my case I want to take something which is a string which might have spaces and punctuation and rather than just removing those, i would rather replace it with an underscore. Even though spaces are an allowable filename character in most OS's they are problematic. Also, in my case if the original string contained a period I didn't want that to pass through into the filename, or it would generate "extra extensions" that I might not want (I'm appending the extension myself)

def make_safe_filename(s):
    def safe_char(c):
        if c.isalnum():
            return c
        else:
            return "_"
    return "".join(safe_char(c) for c in s).rstrip("_")

print(make_safe_filename( "hello you crazy $#^#& 2579 people!!! : die!!!" ) + ".gif")

prints:

hello_you_crazy_______2579_people______die___.gif

Ronan Boiteau
  • 8,035
  • 6
  • 32
  • 47
uglycoyote
  • 1,164
  • 13
  • 23
  • 2
    I think that function might be better if repeated underscores were replaced with a single underscore. `re.sub('_{2,}', '_', 'hello_you_crazy_______2579_people______die___.gif')` `` `>> 'hello_you_crazy_2579_people_die_.gif'` – Xevion Mar 18 '20 at 17:07
  • @Xevion But that increases the chance even more that different strings are mapped to the same filename. – BlackJack Feb 26 '21 at 16:40
7

More or less what has been mentioned here with regexp, but in reverse (replace any NOT listed):

>>> import re
>>> filename = u"ad\nbla'{-+\)(ç1?"
>>> re.sub(r'[^\w\d-]','_',filename)
u'ad_bla__-_____1_'
Filipe Pina
  • 2,071
  • 22
  • 29
  • 1
    Just use `\W` which "Matches anything other than a letter, digit or underscore. Equivalent to `[^a-zA-Z0-9_]`". – Escape0707 Jan 25 '21 at 10:08
5

If you don't mind to import other packages, then werkzeug has a method for sanitizing strings:

from werkzeug.utils import secure_filename

secure_filename("hello.exe")
'hello.exe'
secure_filename("/../../.ssh")
'ssh'
secure_filename("DROP TABLE")
'DROP_TABLE'

#fork bomb on Linux
secure_filename(": () {: |: &} ;:")
''

#delete all system files on Windows
secure_filename("del*.*")
'del'

https://pypi.org/project/Werkzeug/

Anders_K
  • 843
  • 7
  • 23
4

No solutions here, only problems that you must consider:

  • what is your minimum maximum filename length? (e.g. DOS supporting only 8-11 characters; most OS don't support >256 characters)

  • what filenames are forbidden in some context? (Windows still doesn't support saving a file as CON.TXT -- see https://blogs.msdn.microsoft.com/oldnewthing/20031022-00/?p=42073)

  • remember that . and .. have specific meanings (current/parent directory) and are therefore unsafe.

  • is there a risk that filenames will collide -- either due to removal of characters or the same filename being used multiple times?

Consider just hashing the data and using the hexdump of that as a filename?

Dragon
  • 1,743
  • 1
  • 16
  • 29
3

I admit there are two schools of thought regarding DIY vs dependencies. But I come from the firm school of thought that prefers not to reinvent wheels, and to see canonical approaches to simple tasks like this. To wit I am a fan of the pathvalidate library

https://pypi.org/project/pathvalidate/

Which includes a function sanitize_filename() which does what you're after.

I would preference this to any one of the numerous home baked solutions. In the ideal I'd like to see a sanitizer in os.path which is sensitive to filesystem differences and does not do unnecessary sanitising. I imagine pathvalidate takes the conservative approach and produces valid filenames that can span at least NTFS and ext4 comfortably, but it's hard to imagine it even bothers with old DOS constraints.

Bernd Wechner
  • 1,101
  • 1
  • 9
  • 23
1

The problem with many other answers is that they only deal with character substitutions; not other issues.

Here is a comprehensive universal solution. It handles all types of issues for you, including (but not limited too) character substitution. It should cover all the bases.

Works in Windows, *nix, and almost every other file system.

def txt2filename(txt, chr_set='printable'):
    """Converts txt to a valid filename.

    Args:
        txt: The str to convert.
        chr_set:
            'printable':    Any printable character except those disallowed on Windows/*nix.
            'extended':     'printable' + extended ASCII character codes 128-255
            'universal':    For almost *any* file system. '-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
    """

    ext = '' if '.' not in txt else txt[txt.rfind('.'):]

    FILLER = '-'
    MAX_LEN = 255  # Maximum length of filename is 255 bytes in Windows and some *nix flavors.

    # Step 1: Remove excluded characters.
    BLACK_LIST = set(chr(127) + r'<>:"/\|?*')                           # 127 is unprintable, the rest are illegal in Windows.
    white_lists = {
        'universal': {'-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'},
        'printable': {chr(x) for x in range(32, 127)} - BLACK_LIST,     # 0-32, 127 are unprintable,
        'extended' : {chr(x) for x in range(32, 256)} - BLACK_LIST,
    }
    white_list = white_lists[chr_set]
    result = ''.join(x
                     if x in white_list else FILLER
                     for x in txt)

    # Step 2: Device names, '.', and '..' are invalid filenames in Windows.
    DEVICE_NAMES = 'CON,PRN,AUX,NUL,COM1,COM2,COM3,COM4,' \
                   'COM5,COM6,COM7,COM8,COM9,LPT1,LPT2,' \
                   'LPT3,LPT4,LPT5,LPT6,LPT7,LPT8,LPT9,' \
                   'CONIN$,CONOUT$,..,.'.split()  # This list is an O(n) operation.
    if result in DEVICE_NAMES:
        result = f'-{result}-'

    # Step 3: Truncate long files while preserving the file extension.
    result = result[:MAX_LEN - len(ext)] + ext

    # Step 4: Windows does not allow filenames to end with '.' or ' ' or begin with ' '.
    result = re.sub(r'^[. ]', FILLER, result)
    result = re.sub(r' $', FILLER, result)

return result

It replaces non-printable characters even if they are technically valid filenames because they are not always simple to deal with.

No external libraries needed.

ChaimG
  • 4,628
  • 3
  • 24
  • 39
0

Here is what I came with, being inspired by uglycoyote:

import time

def make_safe_filename(s):
    def safe_char(c):
        if c.isalnum() or c=='.':
            return c
        else:
            return "_"

    safe = ""
    last_safe=False
    for c in s:
      if len(safe) > 200:
        return safe + "_" + str(time.time_ns() // 1000000)

      safe_c = safe_char(c)
      curr_safe = c != safe_c
      if not last_safe or not curr_safe:
        safe += safe_c
      last_safe=curr_safe
    return safe

And to test:

print(make_safe_filename( "hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!" ) + ".gif")
Martin Kunc
  • 111
  • 7
0

Another approach is to specify a replacement for any unwanted symbol. This way filename may look more readable.

>>> substitute_chars = {'/':'-', ' ':''}
>>> filename = 'Cedric_Kelly_12/10/2020 7:56 am_317168.pdf'
>>> "".join(substitute_chars.get(c, c) for c in filename)
'Cedric_Kelly_12-10-20207:56am_317168.pdf'
Dmitry
  • 1
0

Python:

for c in r'[]/\;,><&*:%=+@!#^()|?^':
    filename = filename.replace(c,'')

(just an example of characters you will want to remove) The r in front of the string makes sure the string is interpreted in it's raw format, allowing you to remove backslash \ as well

Edit: regex solution in Python:

import re
re.sub(r'[]/\;,><&*:%=+@!#^()|?^', '', filename)
Remi
  • 17,911
  • 8
  • 51
  • 41
  • 4
    There might be infinite many characters which might be strange. It is not really a solution to add more and more to that list over the time. – Albert Sep 13 '11 at 17:50
  • I see; are the ALLOWED characters known? – Remi Sep 13 '11 at 17:55
  • I don't really know how to define the allowed chars. Basically I mean all chars which can be displayed and don't have some strange behavior (in that they have negative width or add a newline or so). That is what I mean with 'sane'. That is basically the whole question, because otherwise, it would be trivial. – Albert Sep 13 '11 at 17:59
  • I think you rather want "][[]" to capture both "[" and "]". I'm not sure though – ealfonso Sep 23 '13 at 19:35
  • 1
    @Albert: Unicode is not infinite, and as a user if I'm going to input a file name I don't really want strange program logic to decide what I may or may not put in there. Removing just enough to ensure safety (such as directory separators and relative path markers like `.` and `..`) is fine, but removing more? I'm not sure. – Clément Dec 12 '15 at 01:27
  • 1
    Quite sure this regex is wrong. [ is a special char in regex. – Carson Ip Jan 03 '20 at 09:09
  • -1. The regex solution is clearly untested. As @CarsonIp points out, it uses regex-reserved characters, not only `[`, but also `]*+^?|`. Because of this, the regex fails to compile. Also, this approach just doesn't work well generally, because as the OP points out, a character blacklist simply doesn't scale well at all, so a whitelist is probably preferable. – Graham Mar 31 '20 at 20:36