3

I want to be able to output both "==" and "=" as tokens.

For example, the input text file is:

biscuit==cookie apple=fruit+-()

The output:

biscuit
=
=
cookie
apple
=
fruit
+
-
(
)

What I want the output to be:

biscuit
==
cookie
apple
=
fruit
+
-
(
)

Here is my code:

    Scanner s = null;
    try {
        s = new Scanner(new BufferedReader(new FileReader("input.txt")));
        s.useDelimiter("\\s|(?<=\\p{Punct})|(?=\\p{Punct})");

        while (s.hasNext()) {

            String next = s.next();
            System.out.println(next);
       }
    } finally {
        if (s != null) {
            s.close();
        }
    }

Thank you.

Edit: I want to be able to keep the current regex.

Codemon
  • 229
  • 1
  • 4
  • 13

4 Answers4

5

Just split the input string according to the below regex .

String s = "biscuit==cookie apple=fruit"; 
String[] tok = s.split("\\s+|\\b(?==+)|(?<==)(?!=)");
System.out.println(Arrays.toString(tok));

Output:

[biscuit, ==, cookie, apple, =, fruit]

Explanation:

  • \\s+ Matches one or more space characters.
  • | OR
  • \\b(?==+) Matches a word boundary only if it's followed by a = symbol.
  • | OR
  • (?<==) Lookafter to = symbol.
  • (?!=) And match the boundary only if it's not followed by a = symbol.

Update:

String s = "biscuit==cookie apple=fruit+-()"; 
String[] tok = s.split("\\s+|(?<!=)(?==+)|(?<==)(?!=)|(?=[+()-])");
System.out.println(Arrays.toString(tok));

Output:

[biscuit, ==, cookie, apple, =, fruit, +, -, (, )]
Avinash Raj
  • 160,498
  • 22
  • 182
  • 229
  • you mean like this? `s.useDelimiter("\\s|(?=[\\s=]+)|(?<=\\p{Punct})|(?=\\p{Punct})");` ? I want to be able to keep what I did here. – Codemon Oct 19 '14 at 17:06
  • But I need to be able to output tokens such as separators (`; ] )` ) and operators ( `+ - %`) as separate from the string and if I add your regex to mine it doesn't work. – Codemon Oct 19 '14 at 17:13
  • 1
    post an example which reflect the above comment. – Avinash Raj Oct 19 '14 at 17:15
  • sorry but I found better answer. may be I should be more patient. I know you are an expert but please have explanation so I can learn from you. +1 for added explanation :) – Kick Buttowski Oct 19 '14 at 17:22
  • Yes it works! But I'm going to have to go with the last answer to this question as it is a shorter solution which fits because I have lots of separators and operators. Thank you sir. +1 – Codemon Oct 19 '14 at 18:03
2

In other words you want to split on

  1. one or more whitespaces
  2. place which has = after it and non-= before it (like foo|= where | represents this place)
  3. place which has = before it it and non-= after it (like =|foo where | represents this place)

In other words

s.useDelimiter("\\s+|(?<!=)(?==)|(?<==)(?!=)");
//             ^^^^^ ^^^^^^^^^^^ ^^^^^^^^^^^
//cases:         1)        2)        3)

Since it looks like you are building parser I would suggest using tool which will let you build correct grammar like http://www.antlr.org/. But if you must stick with regex then other improvement which will let you build regex easier would be using Matcher#find instead of delimiter from Scanner. This way your regex and code could look like

    String data = "biscuit==cookie apple=fruit+-()";

    String regex = "<=|==|>=|[\\Q<>+-=()\\E]|[^\\Q<>+-=()\\E]+";
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(data);

    while (m.find())
        System.out.println(m.group());

Output:

biscuit
==
cookie apple
=
fruit
+
-
(
)

You can make this regex more general by using

String regex = "<=|==|>=|\\p{Punct}|\\P{Punct}+";
//                       ^^^^^^^^^^ ^^^^^^^^^^^-- standard cases
//              ^^ ^^ ^^------------------------- special cases

Also this approach would require reading data from file first, and storing it in single String which you would parse. You can find many ways of how to read text from file for instance in this question: Reading a plain text file in Java

so you can use something like

String data = new String(Files.readAllBytes(Paths.get("input.txt")));

You can specify encoding which String should use while reading bytes from file by using constructor String(bytes, encoding). So you can write it as new String(butes,"UTF-8") or to avoid typos while selecting encoding use one of stored in StandardCharsets class like new String(bytes, StandardCharsets.UTF_8).

Community
  • 1
  • 1
Pshemo
  • 113,402
  • 22
  • 170
  • 242
  • @Pshemo See my edited question. If I add your regex to mine it doesn't work. I need to be able to edit it such that all that I've done so far remains. – Codemon Oct 19 '14 at 17:23
  • @Codemon It looks like you are trying to create parser with regex. If so then regex is not the best approach. You should be using grammar tool like [ANTLR](http://www.antlr.org/) instead. Anyway I will try to update my answer but that will be the last one (if you will have any other requests about regex and parser then it is sign that you should definitely use grammar instead of regex). – Pshemo Oct 19 '14 at 17:29
  • @Pshemon This is part of my homework and I am requested to use regex – Codemon Oct 19 '14 at 17:31
  • @Codemon Is scanner obligatory also? I would say that using `Matcher.find` would be easier than setting delimiter. – Pshemo Oct 19 '14 at 17:33
  • Well I am reading from a file, I'm not quite sure how to read it otherwise – Codemon Oct 19 '14 at 17:38
  • while this is an excellent answer, I used \p{Punct} because I have a lot of separators and operators I need to take into consideration. I have found a shorter solution on the answer below yours. +1 for excellent explanation . – Codemon Oct 19 '14 at 17:59
2

You might be able to qualify those punctuations with some additional assertions.

 # "\\s|(?<===)|(?<=\\p{Punct})(?!(?<==)(?==))|(?=\\p{Punct})(?!(?<==)(?==))"

   \s 
|  (?<= == )
|  (?<= \p{Punct} )
   (?!
        (?<= = )
        (?= = )
   )
|  (?= \p{Punct} )
   (?!
        (?<= = )
        (?= = )
   )

Info update

If some characters aren't covered in \p{Punct} just add them as a separate class within
the punctuation subexpressions.

For engines that don't do certain properties well inside classes, use this ->

 #  Raw:   \s|(?<===)|(?<=\p{Punct}|[=+])(?!(?<==)(?==))|(?=\p{Punct}|[=+])(?!(?<==)(?==))

    \s 
 |  (?<= == )
 |  (?<= \p{Punct} | [=+] )
    (?!
         (?<= = )
         (?= = )
    )
 |  (?= \p{Punct} | [=+] )
    (?!
         (?<= = )
         (?= = )
    )

For engines that handle properties well inside classes, this is a better one ->

 #  Raw:   \s|(?<===)|(?<=[\p{Punct}=+])(?!(?<==)(?==))|(?=[\p{Punct}=+])(?!(?<==)(?==))

    \s 
 |  (?<= == )
 |  (?<= [\p{Punct}=+] )
    (?!
         (?<= = )
         (?= = )
    )
 |  (?= [\p{Punct}=+] )
    (?!
         (?<= = )
         (?= = )
    )
  • accepted this answer because it's both correct and the shortest solution. thank you sir. – Codemon Oct 19 '14 at 17:56
  • @vks it works in my program.. with the exact same input – Codemon Oct 19 '14 at 18:08
  • @vks That is why I don't like answers on Java questions which are linked to examples in regex101 or other sites where you can't chose regex engine from Java explicitly. Only because something works on regex101 doesn't mean that it will also work in Java (negation is also true, if somethinng doesn't work in regex101 doesn't mean it will also not work in Java). – Pshemo Oct 19 '14 at 18:12
  • 1
    @vks Could it be that it doesn't recognize `\p{Punct}` ? If you try puttin in `\\s|(?<=\\p{Punct})` it doesn't do what it should – Codemon Oct 19 '14 at 18:12
  • @Pshemo but what could be the reason........all features are supported by java right???jut additional escaping ? – vks Oct 19 '14 at 18:19
  • @Codemon changed punct to `\p{P}` – vks Oct 19 '14 at 18:21
  • @vks same result. Pshemo is probably right. regex101 must have completely different syntax for some or none at all. Bottom line is that it works for my program and I'm happy :D woohoo! – Codemon Oct 19 '14 at 18:26
  • @vks It seems that `\p{Punct}` in Java represents `[!"#$%&'()*+,-.\/:;<=>?@[\]^_\`{|}~]`. If you use this class instead of `\p{P}` you will get correct results. I am not sure what `\p{P}` stands for in Java or regex101. – Pshemo Oct 19 '14 at 18:27
0
(?===)|(?<===)|\s|(?<!=)(?==)|(?<==)(?!=)|(?=\p{P})|(?<=\p{P})|(?=\+)

You can try this.Se demo.

http://regex101.com/r/wQ1oW3/18

vks
  • 63,206
  • 9
  • 78
  • 110