2

I'm trying to use SPARQL to query literals that have regexes with balanced parentheses. So "( (1) ((2)) (((3))) 4)" should be returned, but "( (1) ((2)) (((3)) 4)", where I removed a closing parenthesis after the "3", should not be returned.

I've previously looked here for a suitable regex: Regular expression to match balanced parentheses

And have been trying to implement regex suggested by rogal111, which is as follows:

\(([^()]|(?R))*\)

This regex follows the PCRE syntax, which I understand is the W3C standard and should be followed by SPARQL. According to the linked example http://regex101.com/r/lF0fI1/1 this should work for the examples above.

I've tested this on both a Jena based triple store, and a Virtuoso based triple store.

Jena: when I try to implement it for SPARQL with the query below, it says that the (?R) inline modifier is unknown.

SELECT ?lf
WHERE
{
  BIND("(test)" AS ?l)
  FILTER REGEX(?l, "\\(([^()]|(?R))*\\)").
}

The complete error message that is returned is below.

Regex pattern exception: java.util.regex.PatternSyntaxException: Unknown inline modifier near index 11 \(([^()]|(?R))*\)

Virtuoso: The Virtuoso based triple store (tested on: https://sparql.uniprot.org/sparql) does work, but also returns incorrect outputs, as exemplified with the query below:

SELECT ?lf
WHERE
{
  BIND("((test)" AS ?l)
  FILTER REGEX(?l, "\\(([^()]|(?R))*\\)").
}

I'm not sure whether this is intentional, a bug, or that I'm doing something wrong. Ultimately I want to get it to work on the Jena based triplestore. Can anyone help me with this?

Wytz
  • 23
  • 3
  • 1
    most of the RDF tools (including Jena) are written in Java but Java regex libraray does not support regex recursion. look at https://stackoverflow.com/questions/47162098 where a non-recursive solution is discussed and apply it to your case. Also suggest to use `REPLACE` instead of `REGEX` and apply a filter to check if the result of the replacement is empty which indicates that whole literal matches – Damyan Ognyanov Jun 23 '20 at 13:26
  • 1
    Java regex does not support recursion. – Wiktor Stribiżew Jun 23 '20 at 14:07
  • Thank you both for your responses! So if I understand your answers correctly, Java based triple stores cannot fully implement the https://www.w3.org/TR/sparql11-query/ recommendation? I'll look at an alternative solution then. – Wytz Jun 24 '20 at 07:00
  • 1
    @Wytz SPARQL grammar REGEXP at https://www.w3.org/TR/sparql11-query/#func-regex points to XPath and XQuery section at https://www.w3.org/TR/xpath-functions/#regex-syntax which in turn lead to https://www.w3.org/TR/xmlschema-2/#regexs and nowhere in these documents I could find a reference that it require support of recursion ... – Damyan Ognyanov Jun 24 '20 at 08:23
  • @DamyanOgnyanov You are correct, it appears that I misinterpreted the description on https://www.w3.org/TR/xpath-functions/#regex-syntax. I thought it meant that even though the syntax may sometimes differ from e.g. Perl (which they refer to, and which should be PCRE I believe), it would broadly offer the same validity checking functionalities. – Wytz Jun 24 '20 at 10:41

1 Answers1

1

Just to clarify and augment my comment about the use of REPLACE, the following should work:

SELECT * 
{
    VALUES ?value { 
        "( (1) ((2)) (((3))) 4)" 
        "( (1) ((2)) (((3)) 4)"
        "before (test) after" 
        "before ((test) after"
    }
    bind(!regex(
            replace(?value, '(?=\\()(?:(?=.*?\\((?!.*?\\1)(.*\\)(?!.*\\2).*))(?=.*?\\)(?!.*?\\2)(.*)).)+?.*?(?=\\1)[^(]*(?=\\2$)', '') 
            , '[()]') as ?result)
}
Damyan Ognyanov
  • 721
  • 3
  • 6