73

Today I discovered a bug for Git on Mac OS X.

For example, I will commit a file with the name überschrift.txt with the German special character Ü at the beginning. From the command git status I get following output.

Users-iMac: user$ git status

On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#   "U\314\210berschrift.txt"
nothing added to commit but untracked files present (use "git add" to track)

It seems that Git 1.7.2 has a problem with German special characters on Mac OS X. Is there a solution to get Git read the file names correct?

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
LuckyMalaka
  • 9,651
  • 4
  • 32
  • 56
  • See also [commit 3a59e59](https://github.com/git/git/commit/3a59e5954ef19ac94522219c2f29d49a187d31d8) (01 Jul 2015) by [Karsten Blees (`kblees`)](https://github.com/kblees). (Merged by [Junio C Hamano -- `gitster` --](https://github.com/gitster) in [commit 81bc521](https://github.com/git/git/commit/81bc521af22a6549e93d33e57de40d335e0ee65b), 03 Aug 2015) – VonC Aug 16 '15 at 19:08

7 Answers7

89

Enable core.precomposeunicode on the mac

git config --global core.precomposeunicode true

For this to work, you need to have at least Git 1.8.2.

Mountain Lion ships with 1.7.5. To get a newer git either use git-osx-installer or homebrew (requires Xcode).

That's it.

mikemaccana
  • 81,787
  • 73
  • 317
  • 396
chicken
  • 1,509
  • 1
  • 13
  • 13
  • Is there a way to fix paths without re-cloning from the origin? I've cloned a rather large [git-annex](http://git-annex.branchable.com/) repo, with gigabytes of files. – Joel Purra Feb 02 '14 at 14:55
  • While the size of the repo might not be important for this problem and fix, this particular repo contains tens of thousands of files and hundreds of revisions - I'd rather perform an in-place fix. – Joel Purra Feb 02 '14 at 16:04
  • What's odd is that this /sort of/ works for me. I have a large git-annex repo, and when I turn this on, /most/ of the git status noise goes away. But not all of it. :-/ – Justin L. Apr 20 '14 at 20:02
  • It used to work perfectly when I wrote, but recently I too saw some bugs around this again. – chicken Apr 21 '14 at 20:49
  • 25
    Oddly, for me the _opposite_ worked (`git config --global core.precomposeunicode false`). I'm running OS X 10.9.2 and Git 1.8.5.2, with the files stored on a disk image with the HFS+ filesystem. Could it be that Apple changed their implementation? – Philipp Apr 26 '14 at 10:17
  • 1
    Kudos @Philipp — that change did the trick. This would make for an important update to the answer! – danyowdee Aug 26 '14 at 18:15
  • @Philipp's comment works for me, too. Hopefully the answer will be edited. Thanks people! – laurelnaiad Sep 08 '14 at 15:27
  • 2
    I had to set the configuration parameter to `false` on OS X 10.10 and Git 2.0.0. I didn't have to clone nor checkout again. It just worked. – J. B. Rainsberger Nov 01 '14 at 12:54
  • 2
    For me setting it to true (default on Git 2.2.0/Mac OS X 10.9.5) incorrectly shows 5 files with unusual names as untracked. 4 are shown surrounded by double quotes. If I set it to false, 4 of them are tracked but the one without double quotes remains untracked. The 4 probably have Korean characters, whilst the fifth has an umlaut. Any ideas? – Sam Brightman Dec 12 '14 at 11:27
  • 3
    This worked for me, although only after omitting `--global`. – Tim-Erwin Feb 15 '16 at 08:42
35

The cause is the different implementation of how the filesystem stores the file name.

In Unicode, Ü can be represented in two ways, one is by Ü alone, the other is by U + "combining umlaut character". A Unicode string can contain both forms, but as it's confusing to have both, the file system normalizes the unicode string by setting every umlauted-U to Ü, or U + "combining umlaut character".

Linux uses the former method, called Normal-Form-Composed (or NFC), and Mac OS X uses the latter method, called Normal-Form-Decomposed (NFD).

Apparently Git doesn't care about this point and simply uses the byte sequence of the filename, which leads to the problem you're having.

The mailing list thread Git, Mac OS X and German special characters has a patch in it so that Git compares the file names after normalization.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Yuji
  • 33,649
  • 3
  • 65
  • 85
  • 3
    Umlaut normalization is a huge mistake. A file system should not be built in ways so things running on top have to "care" about strange modifications happening. Ken Thompson would say this is not a feature, it's a symptom. It can break virtually any system--not only git. I've recently copied a web dump. Umlaut normalization broke it, because an html file referenced an image with an umlaut in its file name. I bet it's a security issue as well. – wnrph Nov 06 '12 at 13:19
  • 1
    Actually, Linux does not always use NFC. Linux (as in the kernel and filesystems) just does not care and treats filenames as byte arrays. Normalization is up to the C library and applications; most use NFC, but that's only a convention. – sleske Oct 17 '14 at 10:54
9

The following put in ~/.gitconfig works for me on 10.12.1 Sierra for UTF-8 names:

precomposeunicode = true
quotepath = false

The first option is needed so that git 'understands' UTF-8 and the second one so that it doesn't escape the characters.

el.nicko
  • 413
  • 5
  • 9
5

To make git add file work with umlauts in file names on Mac OS X, you may convert file path strings from composed into canonically decomposed UTF-8 using iconv.

# test case

mkdir testproject
cd testproject

git --version    # git version 1.7.6.1
locale charmap   # UTF-8

git init
file=$'\303\234berschrift.txt'    # composed UTF-8 (Linux-compatible)
touch "$file"
echo 'Hello, world!' > "$file"

# convert composed into canonically decomposed UTF-8
# cf. http://codesnippets.joyent.com/posts/show/12251
# printf '%s' "$file" | iconv -f utf-8 -t utf-8-mac | LC_ALL=C vis -fotc 
#git add "$file"
git add "$(printf '%s' "$file" | iconv -f utf-8 -t utf-8-mac)"  

git commit -a -m 'This is my commit message!'
git show
git status
git ls-files '*'
git ls-files -z '*' | tr '\0' '\n'

touch $'caf\303\251 1' $'caf\303\251 2' $'caf\303\251 3'
git ls-files --other '*'
git ls-files -z --other '*' | tr '\0' '\n'
pete
  • 51
  • 1
  • 1
3

Change the repository's OSX-specific core.precomposeunicode flag to true:

git config core.precomposeunicode.true

To make sure new repositories get that flag, also run:

git config --global core.precomposeunicode true

Here is the relevant snippet from the manpage:

This option is only used by Mac OS implementation of Git. When core.precomposeunicode=true, Git reverts the unicode decomposition of filenames done by Mac OS. This is useful when sharing a repository between Mac OS and Linux or Windows. (Git for Windows 1.7.10 or higher is needed, or Git under cygwin 1.7). When false, file names are handled fully transparent by Git, which is backward compatible with older versions of Git.

user1338062
  • 9,351
  • 3
  • 54
  • 56
2

It is correct.

Your filename is in UTF-8, Ü being represented as LATIN CAPITAL LETTER U + COMBINING DIAERESIS (Unicode 0x0308, utf8 0xcc 0x88) instead of LATIN CAPITAL LETTER U WITH DIAERESIS (Unicode 0x00dc, utf8 0xc3 0x9c). The Mac OS X HFS file system decomposes Unicode in a such way. Git in turn shows the octal-escape form of the non-ASCII filename bytes.

Note that Unicode filenames can make your repository non-portable. For example, msysgit has had problems dealing with Unicode filenames.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
laalto
  • 137,703
  • 64
  • 254
  • 280
0

I had similar problem with my personal repository, so I wrote a helper script with Python 3. You can grap it here: https://github.com/sjtoik/umlaut-cleaner

The script needs a bit of manual labour, but not much.

crysaz
  • 41
  • 2