0

I have a string

text 6ffdfd <a href="http://worldnews.com" target="_blank">toto</a> sdsdsd

I would like to find a regex that would

    1. add a opening span tag just after the end of the a tag html link (that is to say to be precise after the string "target="_blank">"
    1. add a closing span tag just before the a tag closing

The desired end result would be:

 <a href="http://worldnews.com" target="_blank"><span>toto</span></a> sdsdsd

For the moment , I don't find how to achieve 1, and I only partially managed 2. because my current code is wrongly adding white space that I don't want between /span and the closing a tag

Current code

orig_string = 'text 6ffdfd <a href="http://example.com" target="_blank">toto</a> sdsdsd'
end_result = orig_string.gsub(/<\/a>/, '</span> \\0')
print end_result

I have a set up a online editable DEMO here: https://repl.it/repls/SecondCapitalPika

Mathieu
  • 3,841
  • 9
  • 45
  • 89
  • 5
    Do not use regex to parse HTML. Use a proper HTML parser and add the `` tag to the DOM. – Stefan Dec 06 '17 at 14:14
  • hi Stefan, I am doing it as admin on Active Admin inputs. Why shouldn't I do it with a regexp? I'm a ruby newbie so I thought that could work. Are there security issues ? I want to change the input that I enter so that the one with are saved to the database (instead of the one without s) – Mathieu Dec 06 '17 at 14:15
  • See [The Stack Overflow Regular Expressions FAQ](https://stackoverflow.com/a/22944075/477037), there are several links explaining why you should not use regular expressions to parse HTML. – Stefan Dec 06 '17 at 14:20
  • thanks will check out. Learning sth new everyday:) – Mathieu Dec 06 '17 at 14:24
  • Thanks a lot. I went through the very "intense" debate betwene people for and agaisnt using regexp to parse html.In the end I'll go for using it , agreeing with this person "If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's web site. This was a limited, one-time job" (source: https://stackoverflow.com/a/1733489/1467802) – Mathieu Dec 06 '17 at 15:15
  • Indeed it's not like i am parsing 10k pages, it's just me and wanting to add a tag when I add an input i control 100% in my Active Admin Rails panel – Mathieu Dec 06 '17 at 15:16
  • But the reading was very interesting and definiitely detered me from ever using regexp when parsing will be on volatile/uncontrollable/random/mass html parsing – Mathieu Dec 06 '17 at 15:17

2 Answers2

1
orig_string =~ /(?<=>)([^<]*)(?=<\/a>)/
if $1.present?
  end_result = orig_strig.gsub(/(?<=>)([^<]*)(?=<\/a>)/, '<span>\1</span>')
end

Break down

(?<=>) # to have character >  before
([^<]*) # match everything until character <, match everything in a tag
(?=<\/a>) # to have </a> after

Will result in

print end_result
'text 6ffdfd <a href="http://example.com" target="_blank"><span>toto</span></a> sdsdsd'
Nermin
  • 5,810
  • 11
  • 22
  • could you add just a little more info: indeed this works the first time but if i change the text/input in my admin panel, then the script kick in again and i end up with 4 spans instead of 2:) Could I use some if string contains no span yet, then do this... – Mathieu Dec 06 '17 at 15:37
  • aweosme, let me check it. i must really try learning regexp, it seems powerful ! – Mathieu Dec 06 '17 at 16:00
  • how does $1.present work I mean don't you have to say where you check the presence of $1 ? like on orig_string? – Mathieu Dec 06 '17 at 16:01
  • i get undefined method `present?' for nil:NilClass:) – Mathieu Dec 06 '17 at 16:02
  • `nil` should return false for `present?`. Can you try `unless $1.blank?` – Nermin Dec 06 '17 at 16:21
1

If you don't necessarily need a regex, then you could use Nokogiri:

require 'nokogiri'

text = <<-TEXT
  text 6ffdfd <a href="http://worldnews.com" target="_blank">toto</a> sdsdsd
  6ffdfd text <a href="http://worldnews.com" target="_blank">tete</a> sdsdsd
  6ffdfd text <a href="http://worldnews.com">titi</a> sdsdsd
TEXT

doc = Nokogiri.HTML text
doc.css('a[target="_blank"]').each { |anchor| anchor.add_next_sibling '<span>span</span>' }
Sebastian Palma
  • 29,105
  • 6
  • 30
  • 48
  • Hi Sebastian, thanks for your help. Please refer to my comments under my question where I explain why in the end i still opted for a regexp. On top of those reasons, i also did not want to make my Ruby on Rails 4 more heavy than it already is for a very simple need by importing/requiring another library (nokogiri) – Mathieu Dec 06 '17 at 15:18