Java Regex printing ? character

Question

I have the below code and I am trying to extract all the pieces of data from the file "file.txt". Currently this file has only one line:

id-123:value 123

package demo;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class s {

    public static void main(String[] args) throws Exception {
        final String regex = ":[^\\d].*";

        File file = new File("C:\\Users\\user\\Desktop\\file.txt");
        String text, id;
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"));
        String line;
        while ((line = reader.readLine()) != null) {
            text = line.replaceAll("(^id-\\d*):+", "");
            id = line.replaceAll(":\\S.*", "");

            System.out.println(text);
            System.out.println(id);
        }
    }
}

I am able to read the file and get this line correctly, but when i print it on console, I get the below output:

?id-123:value 123
?id-123

Where does the question mark come from? The text file is saved as UTF-8 file, and reading is also UTF-8. Trying to run it in eclipse.

Also, while running this line of code, I get proper output value 123:

System.out.println(string.replaceAll("(^id-\\d*):+", ""));

But the same regex for same line when read from file, it prints:

?id-123:value 123

score 0 · Answer 1 · 2018-04-19T12:18:14.297

0

Where does the question mark come from?

It seems that your editor saved your file in "UTF-8 with BOM" encoding. For example, in the Notepad++ editor, you can specify an encoding "UTF-8 without BOM" and then question marks will not be shown.

For more details:

edited Apr 19 '18 at 12:18

answered Apr 19 '18 at 10:29

score 0 · Accepted Answer · answered Apr 19 '18 at 15:42

Where does the question mark come from? The text file is saved as UTF-8 file, and reading is also UTF-8. Trying to run it in eclipse.

This error took me hours to figure out at my first attempt, but i was lucky enough to figure it out. As Aleksey mentioned in his answer, this happens due to BOM getting appended at the beginning of your UTF-8 encoded file.

What is Bom, you ask? Bom, basically Byte Order Markup are special characters that get added to the beginning of a UTF-8 encoded file. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters ï»¿ for this. Only purpose of BOM characters is to signal "I am a unicode encoded text stream" to parser's or whatever the requiring source is, Or that the stream is converted from a stream that contained optional BOM in it.

You can read more about BOM here. Also the linked questions in Aleksey's post are a good read.

In your case you can simply use a nifty trick to make the program work. It is not the best solution, but it's not the worst either.

Since, BOM Characters only get appended at the beginning of the file, you can simply check if the line begins with the BOM character &#65279, \uFEFF, 0xFEFF etc.

if (line.startsWith("\uFEFF")) { 
    line = line.substring(1); 
}

This will remove the character from the line. Also, it simply depends upon the editor you are using to view the text. Smart editors will simply ignore BOM.

Java Regex printing ? character

2 Answers2