Extracting characters and words from a string

Question

I want to scan an input line character by character and produce Strings based on valid tokens which are “true”, “false”, “^” “&”, “!”, “(”, “)”

For example if i was given a string such as String line = true & ! (false ^ true)

I would have to produce the tokens "true", "&", "!", "(", "false", "^", "true", ")"

I have been trying to use split() to divide the string into tokens and store them in an array like this String[] result = line.split(" "), and then just using a bunch of if-statements inside a loop to see if the token at that index matches any of the valid tokens and just returning the token. this is kind of what i have been trying to use so far

for(int i = 0; i < line.length();i++){
    if(result[i].equals("true") || result[i].equals("false") || result[i].equals("^") 
        || result[i].equals("&") || result[i].equals("!") || result[i].equals("(")
        || result[i].equals(")")){
        nextToken = result[i];
}

but obviously this wont extract valid tokens that are adjacent to one another, such as when the string contains something like this (true or this true^false, which should return three tokens being "true", "^", "false". Is there a way to divide a string that doesn't contain spaces or any special characters into tokens i am interested in?

It sounds like you're trying to perform [lexical analysis](https://en.wikipedia.org/wiki/Lexical_analysis). You should make use of a lexing/parsing library unless you are really determined to write one yourself. — Taylor Hx, Feb 23 '16 at 03:58

score 1 · Accepted Answer · answered Feb 23 '16 at 04:02

So long as the input is accurate, the following will tokenize your input:

public class Tokenizer {

    public static void main(String[] args) {

        // true, false, ^ &, !, (, )
        String SYMBOLS = "^&!()";

        String line = "true&!(false^true)";
        List<String> tokens = new ArrayList<String>();

        char[] in = line.toCharArray();
        for (int i = 0; i<in.length; i++) {
            if (in[i] == ' ')
                continue;
            if (SYMBOLS.indexOf(in[i]) >= 0) {
                tokens.add(String.valueOf(in[i]));
            } else if (in[i] == 't') {
                tokens.add("true");
                i += "true".length()-1;
            } else if (in[i] == 'f') {
                tokens.add("false");
                i += "false".length()-1;
            }
        }

        for (String token : tokens)
            System.out.println(token);

    }
}

Producing output:

true
&
!
(
false
^
true
)

+1. This is a good answer, as long as the input is syntactically correct. Could be improved by checking the whole `true` and `false` tokens. — Taylor Hx, Feb 23 '16 at 04:06
Agreed Taylor. I'll leave it to the OP to add that sort of input conditioning. Thank you. — Ian Mc, Feb 23 '16 at 04:07
Thank you, this makes much more sense than the mess that i was about to create. — , Feb 23 '16 at 04:15

score 0 · Answer 2 · edited May 23 '17 at 11:59

0

Try using delimiters. They will separate strings based on whatever you set as the tokens. I would take a look at this question for more information: How do I use a delimiter in Java Scanner?

edited May 23 '17 at 11:59

Community

1
1

answered Feb 23 '16 at 03:45

Michael

704
4
19

1

for the strings like 'true^false', I think this approach cannot be used as he/she needs all the three 'true','^','false' as tokens. – Priyanka.Patil Feb 23 '16 at 03:55

score 0 · Answer 3 · 2016-02-23T04:09:18.010

Edit :-

if you need the exact count in the exact order you could do this :-

public static void main(String[] args)
{
    final String TOKENS = "true,false,!,),(";
    String [] splittedTokens = TOKENS.split(",");
    String Data = "'true','^','false'";

    ArrayList <String> existingTokens = new ArrayList<String>();
    for(int i = 0; i < splittedTokens.length; i++)
    {
        if(Data.contains(splittedTokens[i]))
        {
            existingTokens.add(splittedTokens[i]);
        }
    }

    for(int i = 0; i < splittedTokens.length; i++)
    {
        int count = 0;
        for(int j = 0; j < existingTokens.size(); j++)
        {
            if(splittedTokens[i].equals(existingTokens.get(j)))
            {
                count++;
            }
        }
        System.out.println("Number of "+splittedTokens[i]+" = "+count);
    }
}

if you only need all the tokens that the string contains :-

public static void main(String[] args)
{
    final String TOKENS = "true,false,!,),(";
    String [] splittedTokens = TOKENS.split(",");
    String Data = "true^false";

    for(int i = 0; i < splittedTokens.length; i++)
    {
        if(Data.contains(splittedTokens[i]))
        {
            System.out.println("The String Contains "+ splittedTokens[i]);
        }
    }
}

This approach not taking multiple occurrences of a token into account I think. — Priyanka.Patil, Feb 23 '16 at 04:00
This solution works so long as token order and repetition of tokens is unimportant. — Taylor Hx, Feb 23 '16 at 04:00

saka1029 · Answer 4 · 2016-02-23T04:16:42.883

0

Try this.

    String s = "String line=true&!(false^true)";
    String[] p = s.split("\\s+|(?<=[!()^&=])|\\b");
    System.out.print(Arrays.toString(p));
    // -> [String, , line, =, true, &, !, (, false, ^, true, )]

or

String s = "String line=true&!(false^true)";
Matcher m = Pattern.compile("\\w+|[()^&|!]").matcher(s);
while (m.find())
    System.out.println(m.group());

output

String
line
true
&
!
(
false
^
true
)

edited Feb 23 '16 at 04:16

answered Feb 23 '16 at 04:06

saka1029

13,523
2
13
37

Thank you, I'm still new to java so I'm not familiar with regular expressions but its amazing to see that the confusing mess that i was going to write could be written with just one statement. – Feb 23 '16 at 04:18

score 0 · Answer 5 · answered Feb 23 '16 at 05:18

I'd segment using a regular expression. You can set it up to return a list of strings of only valid values "true", "false", "^", "&", "!", "(", or ")", or a list of valid with any invalid groupings also being generated (in case you want to reflect an error and indicate what is wrong).

Inside the matcher loop, simply do what you want with the returned string values. Review this code (note, I'm just outputting the values wrapped in curly-brackets, not adding to an array; you do what you want with them.):

import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class QuickTest {
  public static void main(String[] args) {
    String testIn = "(true^false)&aaa!asa (bbb& ccc)";
    Pattern p1 = Pattern.compile("(true|false|\\^|\\&|\\!|\\(|\\))", Pattern.CASE_INSENSITIVE);
    Matcher m1 = p1.matcher(testIn);
    System.out.println("Match and return only the valid values");
    while (m1.find()) {
      if (m1.group().trim().length() > 0) {
        System.out.println("Found {" + m1.group() + "}");
      }
    }
    Pattern p2 = Pattern.compile("((true|false|\\^|\\&|\\!|\\(|\\))|([^\\^|\\&|\\!|\\(|\\)|\\s*]*)?)", Pattern.CASE_INSENSITIVE);
    Matcher m2 = p2.matcher(testIn);
    System.out.println("Match and return valid and invalid values");
    while (m2.find()) {
      if (m2.group().trim().length() > 0) {
        System.out.println("Found {" + m2.group() + "}");
      }
    }
  }
}

Running this, you get the following output:

Match and return only the valid values
Found {(}
Found {true}
Found {^}
Found {false}
Found {)}
Found {&}
Found {!}
Found {(}
Found {&}
Found {)}
Match and return valid and invalid values
Found {(}
Found {true}
Found {^}
Found {false}
Found {)}
Found {&}
Found {aaa}
Found {!}
Found {asa}
Found {(}
Found {bbb}
Found {&}
Found {ccc}
Found {)}

This has the added benefit that you can actually build the regular expression from list of valid values stored externally, making it a bit more dynamic.

Extracting characters and words from a string

5 Answers5