17

I would like to know how I could transform the given string into the specified array:

String

all ("hi there \(option\)", (this, that), other) another

Result wanted (Array)

[0] => all,
[1] => Array(
    [0] => "hi there \(option\)",
    [1] => Array(
        [0] => this,
        [1] => that
    ),
    [2] => other
),
[2] => another

This is used for a kind of console that I'm making on PHP. I tried to use preg_match_all but, I don't know how I could find parentheses inside parentheses in order to "make arrays inside arrays".

EDIT

All other characters that are not specified on the example should be treated as String.

EDIT 2

I forgot to mention that all parameter's outside the parentheses should be detected by the space character.

Delimitry
  • 2,813
  • 4
  • 25
  • 36
Cristiano Santos
  • 2,055
  • 2
  • 33
  • 47
  • 6
    You are trying to build a syntax tree, or parse tree. I think regex is not a proper tool for that. – Sina Iravanian Feb 04 '13 at 10:29
  • Then, what should I do? – Cristiano Santos Feb 04 '13 at 10:31
  • 1
    @CristianoSantos Write your own parser. – Leri Feb 04 '13 at 10:32
  • 1
    @CristianoSantos you should loop through the input string, which adds words to an array until a close bracket is visited or input finishes. But upon visiting an open bracket this method must call itself (a recursive call) and use the returned array. – Sina Iravanian Feb 04 '13 at 10:34
  • why not simply split with `[\s,()]+` – Anirudha Feb 04 '13 at 10:35
  • @PLB But how? This is the first time that I'm trying to do that and so, I have no experience with "own parsers" – Cristiano Santos Feb 04 '13 at 10:35
  • @Some1.Kill.The.DJ The last time I tried something like that, my result with preg_math_all on the parenteses was this string: `("hi there", (this, that)`. I don't tried your sugestion yet but from what I see, I think I will get the same behavior. – Cristiano Santos Feb 04 '13 at 10:37
  • @CristianoSantos If you think it globally then there will be {},[],<>,special characters etc. then what type of priority you want to use? – Ripa Saha Feb 04 '13 at 10:38
  • @ripa All other characters than the showed above will be treated as `String`. – Cristiano Santos Feb 04 '13 at 10:39
  • @CristianoSantos yes I got it.but there is sub array. my question is on this basis. – Ripa Saha Feb 04 '13 at 10:41
  • @CristianoSantos If you're trying to build your own programming language, you can use already existing syntax generators. – Leri Feb 04 '13 at 10:43
  • @ripa All sub arrays should be detected as the primary array. In other words, all arrays should be found with "(" and ")" and, if I want to include a parenteses on one of the parameters, then I should add a slash to it on the string. Example: `(hi, (first, "second \(other\)"))` – Cristiano Santos Feb 04 '13 at 10:47
  • @PLB Could you give me a link of that? – Cristiano Santos Feb 04 '13 at 10:52

7 Answers7

14

The 10,000ft overview

You need to do this with a small custom parser: code takes input of this form and transforms it to the form you want.

In practice I find it useful to group parsing problems like this in one of three categories based on their complexity:

  1. Trivial: Problems that can be solved with a few loops and humane regular expressions. This category is seductive: if you are even a little unsure if the problem can be solved this way, a good rule of thumb is to decide that it cannot.
  2. Easy: Problems that require building a small parser yourself, but are still simple enough that it doesn't quite make sense to bring out the big guns. If you need to write more than ~100 lines of code then consider escalating to the next category.
  3. Involved: Problems for which it makes sense to go formal and use an already existing, proven parser generator¹.

I classify this particular problem as belonging into the second category, which means that you can approach it like this:

Writing a small parser

Defining the grammar

To do this, you must first define -- at least informally, with a few quick notes -- the grammar that you want to parse. Keep in mind that most grammars are defined recursively at some point. So let's say our grammar is:

  • The input is a sequence
  • A sequence is a series series of zero or more tokens
  • A token is either a word, a string or an array
  • Tokens are separated by one or more whitespace characters
  • A word is a sequence of alphabetic characters (a-z)
  • A string is an arbitrary sequence of characters enclosed within double quotes
  • An array is a series of one or more tokens separated by commas

You can see that we have recursion in one place: a sequence can contain arrays, and an array is also defined in terms of a sequence (so it can contain more arrays etc).

Treating the matter informally as above is easier as an introduction, but reasoning about grammars is easier if you do it formally.

Building a lexer

With the grammar in hand you know need to break the input down into tokens so that it can be processed. The component that takes user input and converts it to individual pieces defined by the grammar is called a lexer. Lexers are dumb; they are only concerned with the "outside appearance" of the input and do not attempt to check that it actually makes sense.

Here's a simple lexer I wrote to parse the above grammar (don't use this for anything important; may contain bugs):

$input = 'all ("hi there", (this, that) , other) another';

$tokens = array();
$input = trim($input);
while($input) {
    switch (substr($input, 0, 1)) {
        case '"':
            if (!preg_match('/^"([^"]*)"(.*)$/', $input, $matches)) {
                die; // TODO: error: unterminated string
            }

            $tokens[] = array('string', $matches[1]);
            $input = $matches[2];
            break;
        case '(':
            $tokens[] = array('open', null);
            $input = substr($input, 1);
            break;
        case ')':
            $tokens[] = array('close', null);
            $input = substr($input, 1);
            break;
        case ',':
            $tokens[] = array('comma', null);
            $input = substr($input, 1);
            break;
        default:
            list($word, $input) = array_pad(
                preg_split('/(?=[^a-zA-Z])/', $input, 2),
                2,
                null);
            $tokens[] = array('word', $word);
            break;
    }
    $input = trim($input);
}

print_r($tokens);

Building a parser

Having done this, the next step is to build a parser: a component that inspects the lexed input and converts it to the desired format. A parser is smart; in the process of converting the input it also makes sure that the input is well-formed by the grammar's rules.

Parsers are commonly implemented as state machines (also known as finite state machines or finite automata) and work like this:

  • The parser has a state; this is usually a number in an appropriate range, but each state is also described with a more human-friendly name.
  • There is a loop that reads reads lexed tokens one at a time. Based on the current state and the value of the token, the parser may decide to do one or more of the following:
    1. take some action that affects its output
    2. change its state to some other value
    3. decide that the input is badly formed and produce an error

¹ Parser generators are programs whose input is a formal grammar and whose output is a lexer and a parser you can "just add water" to: just extend the code to perform "take some action" depending on the type of token; everything else is already taken care of. A quick search on this subject gives led PHP Lexer and Parser Generator?

Community
  • 1
  • 1
Jon
  • 396,160
  • 71
  • 697
  • 768
  • respect @Jon. Can you link any nice articles on this topic? – d.raev Feb 04 '13 at 11:10
  • There is no need to define a language, since an extended regular expression can solve this problem. I bet in one expression. I think if the attendant has no clue of formal languages, this answer is even more misleading for him/her. P.S: Wow this solution gets even more complicated for a PHP application, we are not constructing a language here. :D – Dyin Feb 04 '13 at 11:13
  • 2
    @Dyin: It depends on how you define "need". If you want a parser that is maintainable then you most definitely **need** a grammar. If you want a regex that works for me (tm) but is totally incomprehensible, subject to breaking down at the slightest provocation and ultimately impossible to extend in the future then you don't necessarily **need** a grammar. If you disagree please try to prove me wrong by writing such a regex. – Jon Feb 04 '13 at 11:17
  • How is a regular expression impossible to extend? Truly a regular expression can't tell you where's the syntax error, but the answer should not to implement an LALR or SLR for this. :D Sadly I'm not an expert in extended, recursive regular expressions, but I believe, someone will implement a pattern for this, which solves the problem in 1 step, that is faster. This is a parentheses problem, why would a PHP developer write a lexical and syntactical analyzer for this? – Dyin Feb 04 '13 at 11:26
  • @d.raev: I added some Wikipedia links which are good as landing page. – Jon Feb 04 '13 at 11:27
  • 3
    @Dyin: I 'm not an expert in regex either, but I know enough to understand that I would never, ever want to do this with regex because a) I consider it impossible to *prove* that a regex works correctly in all cases (while it is certainly possible to prove that a parser correctly processes a given formal grammar) and b) it is much easier to *reason* about how a FSM works, so it's much easier to extend and maintain. YMMV. – Jon Feb 04 '13 at 11:30
  • @Jon Your answer is really great. But in a scenario if I was asked to parse this kind of data and just save this information in database or somewhere else that would be easier to use, creating lexer would be overkill, IMO. – Leri Feb 04 '13 at 11:37
  • @PLB: Might be. But since it's not very clear if you would need to "cross the line" (either now or in the future) I 'd prefer to err on the safe side. – Jon Feb 04 '13 at 11:50
4

There's no question that you should write parser if you are building syntax tree. But if you just need to parse this sample input regex still might be a tool:

<?php
$str = 'all, ("hi there", (these, that) , other), another';

$str = preg_replace('/\, /', ',', $str); //get rid off extra spaces
/*
 * get rid off undefined constants with surrounding them with quotes
*/
$str = preg_replace('/(\w+),/', '\'$1\',', $str);
$str = preg_replace('/(\w+)\)/', '\'$1\')', $str);
$str = preg_replace('/,(\w+)/', ',\'$1\'', $str);

$str = str_replace('(', 'array(', $str);

$str = 'array('.$str.');';

echo '<pre>';
eval('$res = '.$str); //eval is evil.
print_r($res); //print the result

Demo.

Note: If input will be malformed regex will definitely fail. I am writing this solution just in a case you need fast script. Writing lexer and parser is time-consuming work, that will need lots of research.

Leri
  • 11,559
  • 5
  • 38
  • 59
  • Thanks, I really just need this to work fast. In my case, there's no problem at all if the regex fails. I really just need to throw a general error and not a specific one. =) – Cristiano Santos Feb 04 '13 at 11:13
  • @CristianoSantos In that case I'd use this script and start reading more about syntax parsers for educational purposes. – Leri Feb 04 '13 at 11:15
  • It is so going to mess up this string `'all, ("hi, there, I am from SO", (these, that) , other), another'` – nhahtdh Feb 04 '13 at 11:16
  • @nhahtdh Yes, it will because of commas. I've noted that regex is a tool in a case strings will be formed as they are in sample. – Leri Feb 04 '13 at 11:20
  • 1
    @CristianoSantos Oh, there were not special characters in question when I was writing this answer. If there're you need to escape them and improve this script for better handling of malformed strings. For _dirty_ job it's ok, for future using purposes big NO. – Leri Feb 04 '13 at 11:22
  • @PLB Ah, sorry. I really forgot to mention them on the beginning. My bad =S – Cristiano Santos Feb 04 '13 at 11:28
  • Your answer and @palindrom answer where the one's that helped me most. So, as I can't give both a "correct answer", I will accept yours because it helped me much more and add my final code as answer. – Cristiano Santos Feb 04 '13 at 12:38
3

As far as I know, the parentheses problem is a Chomsky language class 2, while regular expressions are equivalent to Chomsky language class 3, so there should be no regular expression, which solves this problem.

But I read something not long ago:

This PCRE pattern solves the parentheses problem (assume the PCRE_EXTENDED option is set so that white space is ignored): \( ( (?>[^()]+) | (?R) )* \)

With delimiters and without spaces: /\(((?>[^()]+)|(?R))*\)/.

This is from Recursive Patterns (PCRE) - PHP manual.

There is an example on that manual, which solves nearly the same problem you specified! You, or others might find it and proceed with this idea.

I think the best solution is to write a sick recursive pattern with preg_match_all. Sadly I'm not in the power to do such madness!

Dyin
  • 7,425
  • 6
  • 40
  • 60
  • Regex you see in modern languages are not strictly regular, so it can do thing beyond what theoretical regular expression can do. – nhahtdh Feb 04 '13 at 11:17
3

First, I want to thank everyone that helped me on this.

Unfortunately, I can't accept multiple answers because, if I could, I would give to you all because all answers are correct for different types of this problem.

In my case, I just needed something simple and dirty and, following @palindrom and @PLB answers, I've got the following working for me:

$str=transformEnd(transformStart($string));
$str = preg_replace('/([^\\\])\(/', '$1array(', $str);
$str = 'array('.$str.');';
eval('$res = '.$str);
print_r($res); //print the result

function transformStart($str){
    $match=preg_match('/(^\(|[^\\\]\()/', $str, $positions, PREG_OFFSET_CAPTURE);
    if (count($positions[0]))
        $first=($positions[0][1]+1);
    if ($first>1){
        $start=substr($str, 0,$first);
        preg_match_all("/(?:(?:\"(?:\\\\\"|[^\"])+\")|(?:'(?:\\\'|[^'])+')|(?:(?:[^\s^\,^\"^\']+)))/is",$start,$results);
        if (count($results[0])){
            $start=implode(",", $results[0]).",";
        } else {
            $start="";
        }
        $temp=substr($str, $first);
        $str=$start.$temp;
    }
    return $str;
}

function transformEnd($str){
    $match=preg_match('/(^\)|[^\\\]\))/', $str, $positions, PREG_OFFSET_CAPTURE);
    if (($total=count($positions)) && count($positions[$total-1]))
        $last=($positions[$total-1][1]+1);
    if ($last==null)
        $last=-1;
    if ($last<strlen($str)-1){
        $end=substr($str,$last+1);
        preg_match_all("/(?:(?:\"(?:\\\\\"|[^\"])+\")|(?:'(?:\\\'|[^'])+')|(?:(?:[^\s^\,^\"^\']+)))/is",$end,$results);
        if (count($results[0])){
            $end=",".implode(",", $results[0]);
        } else {
            $end="";
        }
        $temp=substr($str, 0,$last+1);
        $str=$temp.$end;
    }
    if ($last==-1){
        $str=substr($str, 1);
    }
    return $str;
}

Other answers are helpful too for who is searching a better way to do this.

Again, thank you all =D.

Cristiano Santos
  • 2,055
  • 2
  • 33
  • 47
2

I want to know if this works:

  1. replace ( with Array(
  2. Use regex to put comma after words or parentheses without comma

    preg_replace( '/[^,]\s+/', ',', $string )

  3. eval( "\$result = Array( $string )" )

palindrom
  • 13,280
  • 1
  • 16
  • 31
2

I will put the algorithm or pseudo code for implementing this. Hopefully you can work-out how to implement it in PHP:

function Parser([receives] input:string) returns Array

define Array returnValue;

for each integer i from 0 to length of input string do
    charachter = ith character from input string.

    if character is '('
        returnValue.Add(Parser(substring of input after i)); // recursive call

    else if character is '"'
        returnValue.Add(substring of input from i to the next '"')

    else if character is whitespace
        continue

    else
        returnValue.Add(substring of input from i to the next space or end of input)

   increment i to the index actually consumed


return returnValue
Sina Iravanian
  • 15,001
  • 4
  • 28
  • 44
1

if the string values are fixed, it can be done some how like this

$ar = explode('("', $st);

$ar[1] = explode('",', $ar[1]);

$ar[1][1] = explode(',', $ar[1][1]);

$ar[1][2] = explode(')',$ar[1][1][2]);

unset($ar[1][1][2]);

$ar[2] =$ar[1][2][1];

unset($ar[1][2][1]);
Leri
  • 11,559
  • 5
  • 38
  • 59
Abuzer Firdousi
  • 1,489
  • 1
  • 10
  • 24