Excel to text conversion properly handle formula and empty cells

Question

I'm trying to convert an excel file into a tab separated text file via Apache POI. The excel has some cells formatted with formulas and some empty cells.

Here's a sample of original excel file:

Here's an extract of the final output:

'US'    'USORACLEAP'    SYSTEMREFERENCE SUPPLIERID  SUPPLIERNAME    CLASSIFICATION  VENDOR_SITE_CODE    SUPPLIERADDRESS1    SUPPLIERADDRESS2    STATE   ZIPCODE COUNTRY SOURCE  INVOICENUM  INVOICEDATE PAYMENTDATE LINE_DESC   GL_COMPANY  GL_CODE GL_DESCR    COSTCENTER  CC_DESCR    CURRENCY_CODE   CHECK_NUMBER    NUM_DOCS    SPEND   TERM    PAYMENT_METHOD  SYSTEM_APPROVED PO_DISTRIBUTION_ID  WALKER_COST_CENTER  RGL_LEDGER_ENTITY   
US  US Oracle AP        RANDBETWEEN(3000,100000)    "TEXT "&D2  VENDOR  "TEXT "&D3  "TEXT "&D3  "TEXT "&D3  ONTARIO RIGHT(D2,5) US  "TEXT "&D3  "TEXT "&D3  RANDBETWEEN(43831, 44150)   RANDBETWEEN(44105,44135)    "TEXT "&D3  RIGHT("000"&RANDBETWEEN(1,999),3)   RANDBETWEEN(55000, 60000)   "TEXT "&D3  "TEXT "&D3  "TEXT "&D3  USD RANDBETWEEN(2000000,2100000)    RANDBETWEEN(1,4)    RANDBETWEEN(1,100000)/100   IMMEDIATE   Check           "TEXT"&D2   X2  
US  US Oracle AP        31836   "TEXT "&D3  1099    "TEXT "&D4  "TEXT "&D4  "TEXT "&D4  NY  RIGHT(D3,5) US  "TEXT "&D4  "TEXT "&D4  RANDBETWEEN(43831,44150)    RANDBETWEEN(44105,44135)    "TEXT "&D4  RIGHT("000"&RANDBETWEEN(1,999),3)   RANDBETWEEN(55000,60000)    "TEXT "&D4  "TEXT "&D4  "TEXT "&D4  USD RANDBETWEEN(2000000,2100000)    RANDBETWEEN(1,4)    RANDBETWEEN(1,100000)/100   IMMEDIATE   Check           GSUEDCM03   AF2 
US  US Oracle AP        3504    "TEXT "&D4  VENDOR  "TEXT "&D5  "TEXT "&D5  "TEXT "&D5  NY  RIGHT(D4,5) US  "TEXT "&D5  "TEXT "&D5  RANDBETWEEN(43831,44150)    RANDBETWEEN(44105,44135)    "TEXT "&D5  RIGHT("000"&RANDBETWEEN(1,999),3)   RANDBETWEEN(55000,60000)    "TEXT "&D5  "TEXT "&D5  "TEXT "&D5  USD RANDBETWEEN(2000000,2100000)    RANDBETWEEN(1,4)    RANDBETWEEN(1,100000)/100   IMMEDIATE   ACH         GSUEIT001   AF3 
US  US Oracle AP        3504    "TEXT "&D5  VENDOR  "TEXT "&D6  "TEXT "&D6  "TEXT "&D6  NY  RIGHT(D5,5) US  "TEXT "&D6  "TEXT "&D6  RANDBETWEEN(43831,44150)    RANDBETWEEN(44105,44135)    "TEXT "&D6  RIGHT("000"&RANDBETWEEN(1,999),3)   RANDBETWEEN(55000,60000)    "TEXT "&D6  "TEXT "&D6  "TEXT "&D6  USD RANDBETWEEN(2000000,2100000)    RANDBETWEEN(1,4)    RANDBETWEEN(1,100000)/100   IMMEDIATE   ACH         GSUEIT001   AF4 
US  US Oracle AP        3504    "TEXT "&D6  VENDOR  "TEXT "&D7  "TEXT "&D7  "TEXT "&D7  NY  RIGHT(D6,5) US  "TEXT "&D7  "TEXT "&D7  RANDBETWEEN(43831,44150)    RANDBETWEEN(44105,44135)    "TEXT "&D7  RIGHT("000"&RANDBETWEEN(1,999),3)   RANDBETWEEN(55000,60000)    "TEXT "&D7  "TEXT "&D7  "TEXT "&D7  USD RANDBETWEEN(2000000,2100000)    RANDBETWEEN(1,4)    RANDBETWEEN(1,100000)/100   IMMEDIATE   ACH         GSUEIT001   AF5

As you can see, the 1st row represents column headers. Some of the cells (D1) have been converted to the actual formula. The 3rd column doesn't have any values so the whole content shifted towards left in the text file.

Here's the code:

private void convertXlsToText(InputStream inputStream, String delimiter, File targetFile) throws IOException {
        StringBuilder sb = new StringBuilder();
        setMinInflateRatio(0);
        try (Workbook wb = create(inputStream)) {
            Sheet firstSheet = wb.getSheetAt(0);

            for (Row nextRow : firstSheet) {
                Iterator<Cell> cellIterator = nextRow.cellIterator();
                while (cellIterator.hasNext()) {
                    Cell cell = cellIterator.next();
                    switch (cell.getCellType()) {
                        case STRING:
                            sb.append(cell.getStringCellValue()).append(delimiter);
                            break;
                        case BOOLEAN:
                            sb.append(cell.getBooleanCellValue()).append(delimiter);
                            break;
                        case NUMERIC:
                            sb.append(cell.getNumericCellValue()).append(delimiter);
                            break;
                        case FORMULA:
                            sb.append(cell.getCellFormula()).append(delimiter);
                            break;
                        default:
                            sb.append(EMPTY).append(delimiter);
                    }
                }
                sb.append(DEFAULT_LINE_END);
            }
        }

        dumpStringBuilderToFile(sb, targetFile);
    }

Can someone please point out what changes should i be making in my code to fix the alignments and the formula issue? PS: I'm using TAB (\t) as my delimiter.

UPDATE: Here's the updated code after suggestions.

    private void convertXlsToText(InputStream inputStream, String delimiter, File targetFile) throws IOException {
        StringBuilder sb = new StringBuilder();
        setMinInflateRatio(0);
        try (Workbook wb = create(inputStream)) {
            Sheet firstSheet = wb.getSheetAt(0);
            FormulaEvaluator evaluator = wb.getCreationHelper().createFormulaEvaluator();
            DataFormatter formatter = new DataFormatter();
            for (Row nextRow : firstSheet) {
                Iterator<Cell> cellIterator = nextRow.cellIterator();
                while (cellIterator.hasNext()) {
                    Cell cell = cellIterator.next();
                    if (cell != null) {
                        sb.append(format("%-20s", formatter.formatCellValue(cell, evaluator))).append(delimiter);
                    } else {
                        sb.append(format("%-20s", EMPTY)).append(delimiter);
                    }
                }
                sb.append(DEFAULT_LINE_END);
            }
        }

        dumpStringBuilderToFile(sb, targetFile);
    }

Renis1235 · Answer 1 · 2021-03-17T10:55:55.610

2

To get the value from the formula field and not the formula itself check the below implementation:

    FormulaEvaluator evaluator = myWorkbook.getCreationHelper().createFormulaEvaluator();

    CellValue cellValue = evaluator.evaluate(cell); // where **cell** is your formula cell

    switch (cellValue.getCellType()) {
        case Cell.CELL_TYPE_BOOLEAN:
            System.out.println(cellValue.getBooleanValue());
            break;
        case Cell.CELL_TYPE_NUMERIC:
            System.out.println(cellValue.getNumberValue());
            break;
        case Cell.CELL_TYPE_STRING:
            System.out.println(cellValue.getStringValue());
            break;
        case Cell.CELL_TYPE_BLANK:
            break;
        case Cell.CELL_TYPE_ERROR:
            break;
    }               
    }

EDIT:

Regarding the alignment issue, check this: How can I pad a String in Java?

edited Mar 17 '21 at 10:55

answered Mar 17 '21 at 10:25

Renis1235

377
1
13

Yes, i use `\t`. – saran3h Mar 17 '21 at 10:26
@saran3h you should check if the value that you get is a empty string when you read a cell value, in D for example, and there you can put several `\t` to match the formatting – Renis1235 Mar 17 '21 at 10:28
@saran3h also check this, to correctly format your file : https://stackoverflow.com/questions/388461/how-can-i-pad-a-string-in-java/391978#391978 – Renis1235 Mar 17 '21 at 10:31
1

Thanks the link helped a lot. I have fixed the formatting of all cells using String formatter. – saran3h Mar 17 '21 at 10:46
@saran3h glad it helped, I edited my answer so that it can also help somebody else. – Renis1235 Mar 17 '21 at 10:56

Axel Richter · Accepted Answer · 2021-03-18T07:44:31.467

If the requirement is writing Excel data into a text file, then all cell values needs to be get as String. A convenient way to do so is using DataFormatter of apache poi. Using DataFormatter you will get cell values as they are shown in Excel sheets. E.g. having number formats and date formats. And if you are using DataFormatter together with a FormulaEvaluator then formulas get evaluated and evaluated values are converted to String.

To avoid missing empty cells one needs to get the cells count first, because the cell iterator will skip empty cells. For example the cells count from the header row will be the cells cont for each further row also.

So the whole code would be as simple as this:

import org.apache.poi.ss.usermodel.*;
import java.io.*;

class ExcelToText {
 static final String DEFAULT_LINE_END = System.getProperty("line.separator");

 static void convertXlsToText(InputStream inputStream, String delimiter, OutputStream outputStream) throws Exception {
  StringBuilder sb = new StringBuilder();
  Workbook workbook = WorkbookFactory.create(inputStream);
  DataFormatter dataFormatter = new DataFormatter(java.util.Locale.US);
  FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
  String cellValue = "";
  Sheet sheet = workbook.getSheetAt(0);
  Row headerRow = sheet.getRow(0);
  int cellCount = 0;
  if (headerRow != null) {
   cellCount = headerRow.getLastCellNum();
  }
  if (cellCount > 0) {
   for (Row row : sheet) {
    for (int c = 0; c < cellCount; c++) {
     Cell cell = row.getCell(c, Row.MissingCellPolicy.CREATE_NULL_AS_BLANK);
     cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
     sb.append(cellValue);
     if (c < cellCount-1) sb.append(delimiter);
    }
    sb.append(DEFAULT_LINE_END);
   }
  }
  workbook.close();
  BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(outputStream, java.nio.charset.StandardCharsets.UTF_8));
  bw.append(sb);
  bw.flush();
  bw.close();
 }

 public static void main(String[] args) throws Exception {
  convertXlsToText(new FileInputStream("./Excel.xlsx"), "\t", new FileOutputStream("./Data.txt"));
 }
}

No CellType checking and extra formula evaluating needed.

To your other requrement: A delimeted text file should only contain real content delimited with the delimiter. There should not be content manipulation. So prepending spaces to the content or filling up with spaces up to a special width is not a good idea in my opinion. If you have tabulator as the delimiter for example, then only tabulator positions set in the text viewer should affect the view. Supplementary added spaces will only disturb.

Makes sense. Checkout the updated code. And about the padding content with spaces: The idea is to further process the text file to store all the contents in database tables. The delimiter will always ensure that the right content goes into the right column regardless of spaces. All the spaces will be trimmed within the processing step. Maybe the spaces are completely unnecessary but it makes things readable to human eyes. — saran3h, Mar 18 '21 at 07:39

Excel to text conversion properly handle formula and empty cells

2 Answers2