1

I have a database of texts that contains this kind of syntax in the middle of English sentences that I need to turn into HTML links using PHP

"text1(text1)":http://www.example.com/mypage

Notes:

  • text1 is always identical to the text in parenthesis

  • The whole string always have the quotation marks, parenthesis, colon, so the syntax is the same for each.

  • Sometimes there is a space at the end of the string, but other times there is a question mark or comma or other punctuation mark.

  • I need to turn these into basic links, like

<a href="http://www.example.com/mypage">text1</a>

How do I do this? Do I need explode or regex or both?

break68
  • 25
  • 4
  • is there always a space or something else after the url? can text1 contain parenthesis or escaped quotes? – Casimir et Hippolyte Aug 31 '14 at 15:25
  • text1 doesn't contain any punctuation mark. Sometimes there is a space at the end of the url, but other times there is a question mark or comma or other punctuation mark. – break68 Aug 31 '14 at 15:45
  • In the middle of English sentences, were is the example for sentence? Url can't be parsed with a simple regex. Other than that the delimiter looks like `"()":` would this be a conflict with the other parts of the sentence? –  Aug 31 '14 at 16:18

4 Answers4

1
"(.*?)\(\1\)":(.*\/[a-zA-Z0-9]+)(?=\?|\,|\.|$)

You can use this.

See Demo.

http://regex101.com/r/zF6xM2/2

vks
  • 63,206
  • 9
  • 78
  • 110
  • +1 for using a simple solution (a backreference) however `but other times there is a question mark or comma or other punctuation mark` - could do with `[\?\.\,]?` or similar stuck outside the last capturing group, otherwise this trailing punctuation will be in the url. And: `text1 doesn't contain any punctuation mark`, the first group can be more restrictive. – AD7six Aug 31 '14 at 16:16
0

You can use this regex:

"(.*?)\(.*?:(.*)

Working demo

enter image description here

Federico Piazza
  • 27,409
  • 11
  • 74
  • 107
0

An appropriate Regular Expression could be:

$str = '"text1(text1)":http://www.example.com/mypage';
preg_match('#^"([^\(]+)' .
           '\(([^\)]+)\)[^"]*":(.+)#', $str, $m);
print '<a href="'.$m[3].'">'.$m[2].'</a>' . PHP_EOL;
giusc
  • 93
  • 3
0

You can use this replacement:

$pattern = '~"([^("]+)\(\1\)":(http://\S+)(?=[\s\pP]|\z)~'; 

$replacement = '<a href="\2">\1</a>'; 

$result = preg_replace($pattern, $replacement, $text);

pattern details:

([^("]+) this part will capture text1 in the group 1. The advantage of using a negated character class (that excludes the double quote and the opening parenthesis) is multiple:

  • it allows to use a greedy quantifier, that is faster
  • since the class excludes the opening parenthesis and is immediatly followed by a parenthesis in the pattern, if in an other part of the text there is content between double quotes but without parenthesis inside, the regex engine will not go backward to test other possibilities, it will skip this substring without backtracking. (This is because the PCRE regex engine converts automatically [^a]+a into [^a]++a before processing the string)

\S+ means all that is not a whitespace one or more times

(?=[\s\pP]|\z) is a lookahead assertion that checks that the url is followed by a whitespace, a punctuation character (\pP) or the end of the string.

Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
  • Thank you for the answer and for the explanation. I tested it and it worked. – break68 Sep 01 '14 at 14:46
  • @break68: I'm glad for you. However be careful with the question mark, since it can be at the end an url *(used for GET values but without GET values)*. In this case, the question mark at the end of the url will be "transformed" into a trailing question mark. I do not know the urls you have to deal with, but it can be a trap. – Casimir et Hippolyte Sep 01 '14 at 19:15