3

I have a dataframe in a Python script (using pandas) that needs to be sorted by multiple columns, but the case of the values currently messes up the sorting. For example a and A are not equally sorted. First, the upper-case letters are sorted and then the lower-case ones. Is there any easy way to sort them ignoring case. Currently I have something like this:

df = df.sort(['column1', 'column2', 'column3', 'column4', 'column5', 'column6', 'column7'], ascending=[True, True, True, True, True, True, True])

It is important that the case needs to be ignored for all of the columns and the values mustn't change their case in the final sorted dataframe.

For example column 1 could be sorted like this (ignoring case):

Aaa
aaB
aaC
Bbb
bBc
bbD
CCc
ccd

Also, it would be awesome, if the functionality would work with x number of columns (no hard-coding).

E. Muuli
  • 2,620
  • 4
  • 16
  • 26
  • If you want to ensure all cases are lower, then you can use `.lower()`. However, if you want to ignore the case altogether it is a little more complicated. Suggest having a look here: https://stackoverflow.com/questions/319426/how-do-i-do-a-case-insensitive-string-comparison-in-python – Nick H Jun 07 '17 at 09:40

1 Answers1

2

if you just want to sort according to lower, you could use something like this:

def sort_naive_lowercase(df, columns, ascending=True):
    df_temp = pd.DataFrame(index = df.index, columns=columns)

    for kol in columns:
        df_temp[kol] = df[kol].str.lower()
    new_index = df_temp.sort_values(columns, ascending=ascending).index
    return df.reindex(new_index)

If you expect unicode problems, you might do something like this (borrowing from @nick-hale's comment):

def sort_by_caseless_columns(df, columns, ascending=True):
    # https://stackoverflow.com/a/29247821/1562285
    import unicodedata

    def normalize_caseless(text):
        return unicodedata.normalize("NFKD", text.casefold())
    df_temp = pd.DataFrame(index = df.index, columns=columns)

    for kol in columns:
        df_temp[kol] = df[kol].apply(normalize_caseless)
    new_index = df_temp.sort_values(columns, ascending=ascending).index
    return df.reindex(new_index)

If you have more possible arguments to pass to the sort_values, you can use **kwargs

If not all the columns are strings, but some are numerical, you might have to include an additional mask or set for the non-numerical columns

Maarten Fabré
  • 6,354
  • 1
  • 13
  • 32