Python Web Scraping - html parsing

Question

I'm trying to extract system status messages from nasdaq website. Here is the part of page source:

</script>
<h2>System Status Messages</h2>
<div id='divSSTAT'>
<div class="genTable">
<table style="width: 100%">
<colgroup>
<col class="gtcol1"></col>
<col class="gtcol2"></col>
<col class="gtcol3"></col>
</colgroup>
<tr>
<th class="gtcol1" style="width: 10%">Time</th>
<th class="gtcol2" style="width: 25%">Market</th>
<th class="gtcol3">Status</th>
</tr>
<tr class='sstatNone' ><td class="tddateWidth" style="white-space: nowrap;">11:56:46 ET</td><td class="tdmarketwidth">NASDAQ<br>BX<br>Post - Trade<br>PSX<br>NASDAQ Options<br>BX Options<br>PHLX<br>NASDAQ Futures<br>ISE<br>GEMX<br>MRX</td><td valign="top">Systems are operating normally</td></tr>
</table>
</div>
</div>

Want the output like this:

System Status Messages
11:56:46 Systems are operating normally

Here is what i do to extract the page content:

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.nasdaqtrader.com/Trader.aspx?id=MarketSystemStatus"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
soup.find_all(["h2","tr"])

This gives a lot of unwanted content. What's the best way to clean it,expecially the lines that contains the actual system message? right now it's like this...

<tr class='sstatNone' ><td class="tddateWidth" style="white-space: nowrap;">11:56:46 ET</td><td class="tdmarketwidth">NASDAQ<br>BX<br>Post - Trade<br>PSX<br>NASDAQ Options<br>BX Options<br>PHLX<br>NASDAQ Futures<br>ISE<br>GEMX<br>MRX</td><td valign="top">Systems are operating normally</td></tr>

Thanks!

Possible duplicate of [How to find elements by class](https://stackoverflow.com/questions/5041008/how-to-find-elements-by-class). — Boris, Jan 08 '19 at 16:50

score 1 · Answer 1 · answered Jan 08 '19 at 16:53

1

You can iterate over the td tags

from bs4 import BeautifulSoup as soup
s = soup(content, 'html.parser')
_start, *_, _end = [i.text for i in s.find_all('td')]
results = f'{s.h2.text}\n{_start} {_end}'
print(results)

Output:

System Status Messages
11:56:46 ET Systems are operating normally

If you do not want ET included in the output, you can use re.sub:

import re
...
results = f'{s.h2.text}\n{re.sub(" [A-Z]+", "", _start)} {_end}'

Output:

System Status Messages
11:56:46 Systems are operating normally

answered Jan 08 '19 at 16:53

Ajax1234

58,711
7
46
83

This works nicely.thanks a lot! probably should be a separate thread, but do you know if there is a package to repeat the same task, for example, every 10 seconds, instead of writing a loop to do so? – td17 Jan 08 '19 at 17:09
@user10144318 Glad to help! – Ajax1234 Jan 08 '19 at 17:09
To repeat it every so often you can use crontab – B.Adler Jan 08 '19 at 17:26
@user10144318 It somewhat depends. If you wish to run the program in the console, a simple `while` loop with a `time.sleep` will suffice. However, if you wish to execute the program on a specific schedule, see [here](https://stackoverflow.com/questions/373335/how-do-i-get-a-cron-like-scheduler-in-python). – Ajax1234 Jan 08 '19 at 17:31

QHarr · Answer 2 · 2019-01-08T19:28:14.927

In the following you could split the 3 selector combinations into 3 individual select_one('indiv selector combination here') selections. Just showing for sake of interest combined. Note that longer selectors and those using quantifiers are slightly less performant in css terms.

import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.nasdaqtrader.com/Trader.aspx?id=MarketSystemStatus'
res = requests.get(url)
soup = bs(res.content,'lxml')
print(' '.join([item.text for item in soup.select('#content h2:nth-of-type(1), #divSSTAT .tddateWidth, #divSSTAT td:nth-of-type(3)')]))

Python Web Scraping - html parsing

2 Answers2