1

I have a problem reading XML. I want to get a 2D array. However, when I get the data from the XML, the type of data is Unicode. Hence, I type to use list(). However, the result is not what I want. Could I use another way to get a 2D list?

How can I remove u, \n, \t and get a correct answer? Thank you.

abc.xml

<text>
    <item id="1">
        [[2, 2, 1],
        [1, 0, 0],
        [1, 0, 0]]
    </item>  
</text>

PYTHON:

import  xml.dom.minidom

dom = xml.dom.minidom.parse('abc.xml')

bb = dom.getElementsByTagName('item')
b=bb[0]

l= b.firstChild.data
print l

a=list(l)
print a

The OUTPUT:

[[2, 2, 1]
 [1, 0, 0] 
 [1, 0, 0]]

[u'\n', u' ', u' ', u' ', u' ', u'\t', u'\t', u'[', u'\n', u' ', u' ', u' ', u' ', u'\t', u'\t', u'\t', u'[', u'2', u',', u' ', u'2', u',', u' ', u'1', u']', u'\n', u' ', u' ', u' ', u' ', u'\t', u'\t', u'\t', u'[', u'1', u',', u' ', u'0', u',', u' ', u'0', u']', u' ', u'\n', u' ', u' ', u' ', u' ', u'\t', u'\t', u'\t', u'[', u'1', u',', u' ', u'0', u',', u' ', u'0', u']', u' ', u'\n', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u'\n', u' ', u' ', u' ', u' ', u'\t', u'\t', u']', u'\n', u' ', u' ', u' ', u' ', u'\t']
[Finished in 0.1s]
Martin Tang
  • 79
  • 10

3 Answers3

1

This question is very similar to an old one: Convert string representation of list to list in Python

In short, you want to parse a unicode string (u"[\n[1,2,3],\n...") into an python list literal, so you want to do the same thing the python interpreter does while reading and parsing a program.

You can use the ast module for this:

import ast
a=ast.literal_eval(l)

in your case. Note that this function will evaluate any python literal, so if you simply put "1" in your xml, the result awill be the number 1

See the documentation for ast.literal_evel for more explanation.

Community
  • 1
  • 1
MartinStettner
  • 27,323
  • 13
  • 73
  • 104
1

A little bit hacky but works for your case:

import ast
from lxml import html

text = """<text>
    <item id="1">
        [
            [2, 2, 1]
            [1, 0, 0] 
            [1, 0, 0] 

        ]
    </item>  
</text>"""

tree = html.fromstring(text)
data = ast.literal_eval(''.join([x.strip() for x in tree.xpath('//text/item[@id="1"]/text()')[0].replace('\n', '').replace(']','],').strip() if x.strip() !=""]).strip())[0]

print type(data)
print data

Output:

<type 'list'>
[[2, 2, 1], [1, 0, 0], [1, 0, 0]]
Andrés Pérez-Albela H.
  • 3,783
  • 1
  • 14
  • 27
0

You could just use mapping to convert unicode to string:

new_list = map(str, old_list)