I'm trying to extract system status messages from nasdaq website. Here is the part of page source:
</script>
<h2>System Status Messages</h2>
<div id='divSSTAT'>
<div class="genTable">
<table style="width: 100%">
<colgroup>
<col class="gtcol1"></col>
<col class="gtcol2"></col>
<col class="gtcol3"></col>
</colgroup>
<tr>
<th class="gtcol1" style="width: 10%">Time</th>
<th class="gtcol2" style="width: 25%">Market</th>
<th class="gtcol3">Status</th>
</tr>
<tr class='sstatNone' ><td class="tddateWidth" style="white-space: nowrap;">11:56:46 ET</td><td class="tdmarketwidth">NASDAQ<br>BX<br>Post - Trade<br>PSX<br>NASDAQ Options<br>BX Options<br>PHLX<br>NASDAQ Futures<br>ISE<br>GEMX<br>MRX</td><td valign="top">Systems are operating normally</td></tr>
</table>
</div>
</div>
Want the output like this:
System Status Messages
11:56:46 Systems are operating normally
Here is what i do to extract the page content:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.nasdaqtrader.com/Trader.aspx?id=MarketSystemStatus"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
soup.find_all(["h2","tr"])
This gives a lot of unwanted content. What's the best way to clean it,expecially the lines that contains the actual system message? right now it's like this...
<tr class='sstatNone' ><td class="tddateWidth" style="white-space: nowrap;">11:56:46 ET</td><td class="tdmarketwidth">NASDAQ<br>BX<br>Post - Trade<br>PSX<br>NASDAQ Options<br>BX Options<br>PHLX<br>NASDAQ Futures<br>ISE<br>GEMX<br>MRX</td><td valign="top">Systems are operating normally</td></tr>
Thanks!