0

Problem: I am returning an EMPTY LIST by doing the following:

output_list = re.findall(r'<td colspan="4" class="yellow-shade border justify">[\r\n]+(.*?)[\r\n]+', INPUTTEXT)

When for example the INPUTTEXT argument is exactly as follows:

<tr>
            <td colspan="4" class="yellow-shade border justify">
            Online Learning Comment         
            <div class="report-action">

              <a class="add-new fb-link"  href="http://blah-blah-blah/write-report?rep[company]=768744&amp;rep[company_name]=Funky Group Services&amp;rep[responds]=1" > Services Report</a>

              <table style="float:right"><tr><td><a class="inappropriate" href="" onclick="window.open('http://blah-blah-blah/inappropriate-report?report=1379443','','toolbar=yes,location=yes,status=yes,menubar=yes,scrollbars=yes,resizable=yes,width=650,height=620'); return false">Inappropriate report?</a></td>

                  <td><a style=' margin-left:15px; float: right;' class="back" href="javascript:history.go(-1)">Back</a></td></tr></table>

            </div>

            </td>
        </tr>

Required Output:

output_list =['Online Learning Comment']. 

What am I missing in my steps. As new as I am to regular expressions, I thought the reg expression I have would work? Any pointers are much appreciated.

HamZa
  • 13,530
  • 11
  • 51
  • 70
IoT
  • 3
  • 3
  • [It does work](http://regex101.com/r/eI3bF1), I would suggest to use `\s` since it also matches newlines, [see demo](http://regex101.com/r/kL7xX6). I would also suggest to [read the manual and try `.match`](https://docs.python.org/2/library/re.html#match-objects) instead of `.findall`. Finally you might add [this reference](http://stackoverflow.com/questions/22937618) to your favorites. – HamZa Apr 09 '14 at 10:30
  • Thanks for the pointer HamZa. It now works with the demo approach and an additional step based on the fragility pointed by @Sour. – IoT Apr 09 '14 at 12:19

1 Answers1

1

I tried your code and it returned [' Online Learning Comment'] to me. You probably have some other invisible symbols besides \r\n there. Try using this regex instead:

r'<td colspan="4" class="yellow-shade border justify">\s+(.*?)[\r\n]'

P.S. Also, this code is very fragile. First, whitespace is meaningless in html and so can be changed arbitrary. Second, classes and attributes you match are no semantic and can easily change in the future.

Suor
  • 2,393
  • 1
  • 18
  • 26