0

I am trying to get the deli title, and then under the deli title get the two menu items Made to Order Deli Core and Turkey Chipotle Petite Wrap? I'm using beautiful soup 4 to do this and its not working. And the same is true for the entree times?

<html>
<head>
    <title></title>
</head>

<body>
    <table class="dayinner">
        <tr class="lun">
            <td class="mealname" colspan="3">LUNCH</td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Deli</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000010000047598_35356" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000047598_35356');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Made to Order Deli Core</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000020000047933_06835" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000047933_06835');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Turkey Chipotle Petite Wrap</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="height:3px;"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Entrée</td>

            <td class="menuitem">
                <div class="menuitem"><input class="chk" id=
                "S1L0000030000044794_08943" onclick="rptlist(this);"
                onmouseout="wschk(0);" onmouseover="wschk(1);" type="checkbox">
                <span class="ul" onclick="nf('0000044794_08943');" onmouseout=
                "pcls(this);" onmouseover="ws(this);">Steamed
                Corn</span><img alt="Vegan" class="icon" src=
                "images/g_062.gif"><img alt="Mindful Item" class="icon" src=
                "images/m_051.gif"></div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000040000033087_22244" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000033087_22244');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Cuban Mojo Roasted Pork Loin</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>
    </table>
</body>
</html>

or if I could get it into a XML format like this:

<counter name="Deli">
    <dish>
        <name>Made to Order Deli Core</name>
    </dish>
    <dish>
        <name>Turkey Chipotle Petite Wrap</name>
    </dish>
</counter>

Thank you very much in advance, I really appreciate you taking the time to help me.

Victor Sigler
  • 22,039
  • 12
  • 83
  • 98

2 Answers2

1

Actually i used both beautiful soup and element tree(for xml parsing) fetch all elements in <span>

# -*- coding: UTF-8 -*-

from bs4 import *
import xml.etree.ElementTree as ET

html='''<html>
<head>
    <title></title>
</head>

<body>
    <table class="dayinner">
        <tr class="lun">
            <td class="mealname" colspan="3">LUNCH</td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Deli</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000010000047598_35356" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000047598_35356');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Made to Order Deli Core</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000020000047933_06835" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000047933_06835');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Turkey Chipotle Petite Wrap</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="height:3px;"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Entrée</td>

            <td class="menuitem">
                <div class="menuitem"><input class="chk" id=
                "S1L0000030000044794_08943" onclick="rptlist(this);"
                onmouseout="wschk(0);" onmouseover="wschk(1);" type="checkbox">
                <span class="ul" onclick="nf('0000044794_08943');" onmouseout=
                "pcls(this);" onmouseover="ws(this);">Steamed
                Corn</span><img alt="Vegan" class="icon" src=
                "images/g_062.gif"><img alt="Mindful Item" class="icon" src=
                "images/m_051.gif"></div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000040000033087_22244" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000033087_22244');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Cuban Mojo Roasted Pork Loin</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>
    </table>
</body>
</html> '''

soup = BeautifulSoup(html)

counter = ET.Element('counter')
counter.set("name", "#Deli")





for i in soup.findAll('span'):
    dish = ET.SubElement(counter, 'dish')
    name = ET.SubElement(dish, 'name')
    name.text= i.text.replace('\n',' ')

print ET.dump(counter)
sundar nataraj
  • 7,699
  • 1
  • 29
  • 42
1

You could this something like this :

# -*- coding: utf-8 -*-

soup = BeautifulSoup(html)
title = soup.find('td', class_='station').text.strip()

spans = soup.find_all('span', class_='ul')

# create the root of the XML file
root = ET.Element("counter")
root.set("name", title)

for item in spans:
    # retrieve the text inside the <td class="station">
    text = list(list(item.parents)[2].previous_siblings)[1].text.strip()
    if text == u'Entrée':
        break

    dish = ET.SubElement(root, 'dish')
    name = ET.SubElement(dish, 'name')
    name.text = item.text.rstrip()

tree = ET.ElementTree(root)
tree.write("filename.xml")

And this is the content of desired xml file :

<counter name="Deli">
    <dish>
        <name>Made to Order Deli Core</name>
    </dish> 
    <dish>
        <name>Turkey Chipotle Petite Wrap</name>
    </dish>
</counter>

Is very important include the following line # -*- coding: utf-8 -*- line above in the beginning of your file to avoid problems with the accent , see SyntaxError: Non-ASCII character '\xa3' in file when function returns '£' for more details.

Community
  • 1
  • 1
Victor Sigler
  • 22,039
  • 12
  • 83
  • 98
  • Thats awesome, but what if i didn't know there was only 2 items. Is there a way to get all the items listed before the Entrèe section? Thanks – user3349689 May 08 '14 at 19:53
  • @user3349689 this line limit to only two `spans = soup.find_all('span', class_='ul')[:2]`, if you want all try this `spans = soup.find_all('span', class_='ul')` – Victor Sigler May 09 '14 at 14:05
  • right but see my problem is that I want it to only get the items under the Deli section and stop getting them before this line ' Entrée', if I leave it as find all it also get the Steamed Corn item which isn't under the deli section, and there can be different number of items under the deli section. Thanks for the help – user3349689 May 09 '14 at 17:49
  • @user3349689 See updated answer, with the above code if encounter the text Entrée stops the creation of the xml and get all between Deli and Entrée – Victor Sigler May 09 '14 at 19:03
  • Ya thats great, but what if I didn't know the number of deli items is there a way to do that, because this only works if I know the number of items? – user3349689 May 13 '14 at 00:52
  • @user3349689 The stop condition it's when encounter the `Entrée` word the deli items no influence over the loop. – Victor Sigler May 13 '14 at 14:05
  • that doesn't seem to be working because if I remove the [:2] on line 6, it gets every span (item) and doesn't stop at the word entree. – user3349689 May 14 '14 at 15:04
  • @user3349689 , please see the updated answer and sorry the [:2] it's a mistake from the last code – Victor Sigler May 14 '14 at 19:39