52

I'm currently using pandas to read an Excel file and present its sheet names to the user, so he can select which sheet he would like to use. The problem is that the files are really big (70 columns x 65k rows), taking up to 14s to load on a notebook (the same data in a CSV file is taking 3s).

My code in panda goes like this:

xls = pandas.ExcelFile(path)
sheets = xls.sheet_names

I tried xlrd before, but obtained similar results. This was my code with xlrd:

xls = xlrd.open_workbook(path)
sheets = xls.sheet_names

So, can anybody suggest a faster way to retrieve the sheet names from an Excel file than reading the whole file?

Mogsdad
  • 40,814
  • 19
  • 140
  • 246
pcarvalho
  • 565
  • 1
  • 4
  • 7

6 Answers6

67

you can use the xlrd library and open the workbook with the "on_demand=True" flag, so that the sheets won't be loaded automaticaly.

Than you can retrieve the sheet names in a similar way to pandas:

import xlrd
xls = xlrd.open_workbook(r'<path_to_your_excel_file>', on_demand=True)
print xls.sheet_names() # <- remeber: xlrd sheet_names is a function, not a property
philshem
  • 22,161
  • 5
  • 54
  • 110
Colin O'Coal
  • 1,207
  • 10
  • 10
  • the xlrd documentation says this works only with BIFF >= 5.0. It does not work for me, and if I ask xls.biff_version it says "0" which seems strange (impossible?). Any idea what may be happening? – famargar Jan 20 '17 at 15:02
  • You could also pass the result from `xlrd.open_workbook` to `pandas.ExcelFile` and interact with the object as you used to: `xls = pandas.ExcelFile(xlrd.open_workbook(path, on_demand=True))` – hugovdberg May 10 '19 at 11:36
  • 1
    This is too slow just for sheet names. – MehmedB Jul 16 '19 at 09:39
  • it spent 11 seconds for 5MB file. It seems it loads whole file – rluts May 12 '20 at 13:04
  • @rluts I agree, this approach is still very slow. Is there a better solution? – Boosted_d16 Jun 09 '20 at 09:41
11

I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used 'on_demand' did not work for me. The following function works for xlsx files.

def get_sheet_details(file_path):
    sheets = []
    file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
    # Make a temporary directory with the file name
    directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
    os.mkdir(directory_to_extract_to)

    # Extract the xlsx file as it is just a zip file
    zip_ref = zipfile.ZipFile(file_path, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()

    # Open the workbook.xml which is very light and only has meta data, get sheets from it
    path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['sheetId'], # can be @sheetId for some versions
                'name': sheet['name'] # can be @name
            }
            sheets.append(sheet_details)

    # Delete the extracted files directory
    shutil.rmtree(directory_to_extract_to)
    return sheets

Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.

Benchmarking: (On a 6mb xlsx file with 4 sheets)
Pandas, xlrd: 12 seconds
openpyxl: 24 seconds
Proposed method: 0.4 seconds

Dhwanil shah
  • 358
  • 2
  • 10
  • You need to configure a bunch of Django settings to use this. Can you tell me how to do this? – Corey Levinson May 28 '19 at 16:05
  • Skirting around the Django requirements, I am also met with this error: `File is not a zip file` My file is a Microsoft Excel 97-2003 Worksheet (.xls) Maybe it doesn't work for old versions of Excel – Corey Levinson May 28 '19 at 16:11
  • 1
    @CoreyLevinson, As mentioned in the post, this works for xlsx files only as .xlsx files use Office Open XML, which is a zipped XML format. .xls files use the BIFF format. I have not yet come across a requirement for xls file. But if you can find a workaround to unzip and read inner content, I think this approach will work for you as well. Also, no django settings have to be altered I only use media_cdn or BASE_DIR and create a temp dir somewhere relative to it for all this. – Dhwanil shah May 29 '19 at 06:43
  • 2
    Your code saved hours of efforts, these code has one catch that I had to fix which is basically if Excel has only one sheet it throws an error which I fixed using these one-liner `sheets1 = dictionary['workbook']['sheets']['sheet'] if not(isinstance(sheets1, list)): sheets1 =[sheets1]` – piyushmandovra Aug 24 '20 at 11:04
  • 1
    @piyushmandovra I did the same fix and came back here to add a comment about it only to find your comment here lol. – Diptangsu Goswami Feb 04 '21 at 15:19
4

From my research with the standard / popular libs this hasn't been implemented as of 2020 for xlsx / xls but you can do this for xlsb. Either way these solutions should give you vast performance improvements. for xls, xlsx, xlsb.

Below was benchmarked on a ~10Mb xlsx, xlsb file.

xlsx, xls

from openpyxl import load_workbook

def get_sheetnames_xlsx(filepath):
    wb = load_workbook(filepath, read_only=True, keep_links=False)
    return wb.sheetnames

Benchmarks: ~ 14x speed improvement

# get_sheetnames_xlsx vs pd.read_excel
225 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.25 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

xlsb

from pyxlsb import open_workbook

def get_sheetnames_xlsb(filepath):
  with open_workbook(filepath) as wb:
     return wb.sheets

Benchmarks: ~ 56x speed improvement

# get_sheetnames_xlsb vs pd.read_excel
96.4 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
5.36 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Notes:

Glen Thompson
  • 5,326
  • 1
  • 37
  • 37
0

By combining @Dhwanil shah's answer with the answer here I wrote code that is also compatible with xlsx files that have only one sheet:

def get_sheet_ids(file_path):
sheet_names = []
with zipfile.ZipFile(file_path, 'r') as zip_ref:
    xml = zip_ref.open(r'xl/workbook.xml').read()
    dictionary = xmltodict.parse(xml)

    if not isinstance(dictionary['workbook']['sheets']['sheet'], list):
        sheet_names.append(dictionary['workbook']['sheets']['sheet']['@name'])
    else:
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_names.append(sheet['@name'])
return sheet_names
JvdB
  • 43
  • 6
0

Python code adaptation with full pathlib path filename passed (e.g., ('c:\xml\file.xlsx')). From Dhwanil shah answer, without Django method used to create a temp dir.

import xmltodict
import shutil
import zipfile


def get_sheet_details(filename):
    sheets = []
    # Make a temporary directory with the file name
    directory_to_extract_to = (filename.with_suffix(''))
    directory_to_extract_to.mkdir(parents=True, exist_ok=True)
    # Extract the xlsx file as it is just a zip file
    zip_ref = zipfile.ZipFile(filename, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()
    # Open the workbook.xml which is very light and only has meta data, get sheets from it
    path_to_workbook = directory_to_extract_to / 'xl' / 'workbook.xml'
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['@sheetId'],  # can be sheetId for some versions
                'name': sheet['@name']  # can be name
            }
            sheets.append(sheet_details)
    # Delete the extracted files directory
    shutil.rmtree(directory_to_extract_to)
    return sheets
-2

you can also use

data=pd.read_excel('demanddata.xlsx',sheet_name='oil&gas')
print(data)   

Here demanddata is the name of your file oil&gas is one of your sheet name.Let there may be n number of sheet in your worksheet.Just Give the Name of the sheet which you like to fetch at Sheet_name="Name of Your required sheet"

  • I'd recommend explaining why this method would be faster than his current approach, or showing that it is faster in certain cases. (Provide more evidence in general) This may be a great answer, but providing more basis as to why it is great will help readers evaluate it. – Xgongiveittoya Oct 08 '19 at 19:17
  • 1
    Thanks @Xgongiveittoya .Thats a good suggestion & I will do it. – HemanthReddy Oct 17 '19 at 18:31
  • This does not do what the OP requested, namely to fetch the names of the sheets from a workbook (initially unknown). Instead, this extracts the data from a sheet that is already known. – Diomedea Sep 07 '20 at 07:32