0

RegEx Expression:

[Height|Length|Width|Depth]:\D*\s*(\d*\.*-*\d*)-*\D*\s*  [Height|Length|Width|Depth]:\D*\s*(\d*\.*-*\d*)-*\D*\s*[Height|Length|Width|Depth]:\D*\s*(\d*\.*-*\d)*-*\D*\s*

Input Text - JSON TEXT

{"Product Type":["Printer Cartridges"],"Product Name":["Xerox - Yellow - toner cartridge ( equivalent to: HP CB382A ) - for HP Color LaserJet CM6030 CM6040 CP6015"],"Brand":["XEROX"],"Product Long Description":["<!-- CNET Content -->Toner cartridges for HP printers from Xerox deliver brilliant image quality and excellent reliability at a low cost. Compared to the original HP toner cartridge youll get better or equal page yield pay around 25% less. Get more pay less without risk.<br><br><h3 id=detailspecs>Specifications</h3><span class=font_size3bold>General</span><br>&nbsp;<img align=absmiddle src=http://images.highspeedbackbone.net/main/gfx-blkbullet.jpg>&nbsp;&nbsp;Compatible Cartridge: &nbsp;HP CB382A<br><br><span class=font_size3bold>Consumable</span><br>&nbsp;<img align=absmiddle src=http://images.highspeedbackbone.net/main/gfx-blkbullet.jpg>&nbsp;&nbsp;Consumable Type: &nbsp;Toner cartridge<br>&nbsp;<img align=absmiddle src=http://images.highspeedbackbone.net/main/gfx-blkbullet.jpg>&nbsp;&nbsp;Printing Technology: &nbsp;Laser<br>&nbsp;<img align=absmiddle src=http://images.highspeedbackbone.net/main/gfx-blkbullet.jpg>&nbsp;&nbsp;Color: &nbsp;Yellow<br>&nbsp;<img align=absmiddle src=http://images.highspeedbackbone.net/main/gfx-blkbullet.jpg>&nbsp;&nbsp;Included Qty: &nbsp;1-pack<br>&nbsp;<img align=absmiddle src=http://images.highspeedbackbone.net/main/gfx-blkbullet.jpg>&nbsp;&nbsp;Duty Cycle: &nbsp;Up to 23500 pages at 5% coverage<br><br><span class=font_size3bold>Compatibility Information</span><br>&nbsp;<img align=absmiddle src=http://images.highspeedbackbone.net/main/gfx-blkbullet.jpg>&nbsp;&nbsp;Compatible with: &nbsp;HP Color LaserJet CM6030 MFP CM6030f MFP CM6040 MFP CM6040f MFP CP6015de CP6015dn CP6015n CP6015x CP6015xh<br><!-- END CNET Content -->"],"Item ID":["41057188"],"Product Segment":["Electronics"],"UPC":["095205855838"]}

Problem:

RegEx should check in JSON text if one of these words --> (Height or Width or Length or Depth) are there then fetch the value.

Since above given JSON text doesn't have this kind of value it should not find anything but my RegEx is finding undesirable value. I think I am missing something in RegEx.

Edit:

For this input JSON - I should be be able to extract Height, Length, Width or Depth:

{"Brand":["Concord Fans"],"Energy Guide: Appliance Labeling Rule Required":["N"],"Country of Origin: Components":["USA and/or Imported"],"Product Short Description":["Height: 6.2."],"Actual Color":["Multicolor"],"Product Segment":["Clothing, Shoes & Accessories"],"Color":["Multicolor"],"Product Name":["Concord Fans RM-08 Remote & Wall Control Set"],"Product Type":["Televisions"],"Manufacturer Part Number":["RM-08"],"Manufacturer":["Concord Fans"],"Category":["TVs"],"Product Long Description":["Height: 6-2- Width: 8-8- Length: 8-8- Energy Star: No- Energy Saver: No- UL Classification: UL Certified- UL Application: Dry SKU: CNCD467"],"GTIN":["00014592213038"],"Number of Batteries":["0"],"E-Waste Recycling Compliance Required":["N"],"UPC":["014592213038"]}
user5431918
  • 107
  • 6

2 Answers2

1

It is not a good idea, in general, to parse JSON data with regular expressions, but you definitely have something wrong in this part of the regular expression:

[Height|Length|Width|Depth]

This would, for instance, match a single "H":

>>> re.search("[Height|Length|Width|Depth]", "H").group()
'H'

It looks like you've meant to use a non-capturing group here:

(?:Height|Length|Width|Depth)

See also:

Community
  • 1
  • 1
alecxe
  • 414,977
  • 106
  • 935
  • 1,083
1

It looks that your data is json compatiable- So try json module instead-Details here. After converting into json you can access using regular dictionary key as d['Product Long Description'] and thereafter you can use many way to extract any information from that - I just showed one way to go-

import json,re

s = """{"Brand":["Concord Fans"],"Energy Guide: Appliance Labeling Rule Required":["N"],"Country of Origin: Components":["USA and/or Imported"],"Product Short Description":["Height: 6.2."],"Actual Color":["Multicolor"],"Product Segment":["Clothing, Shoes & Accessories"],"Color":["Multicolor"],"Product Name":["Concord Fans RM-08 Remote & Wall Control Set"],"Product Type":["Televisions"],"Manufacturer Part Number":["RM-08"],"Manufacturer":["Concord Fans"],"Category":["TVs"],"Product Long Description":["Height: 6-2- Width: 8-8- Length: 8-8- Energy Star: No- Energy Saver: No- UL Classification: UL Certified- UL Application: Dry SKU: CNCD467"],"GTIN":["00014592213038"],"Number of Batteries":["0"],"E-Waste Recycling Compliance Required":["N"],"UPC":["014592213038"]}"""

d=json.loads(json.loads(json.dumps(s)))
print d['Product Long Description']
print ''.join(d['Product Long Description']).split(":")[0:4]
print [filter(len,y) for y in re.findall(r'Height:\s*([\d.]+-[\d.]+)|Width:\s*([\d.]+-[\d.]+)|Length:\s*([\d.]+-[\d.]+)',''.join(d['Product Long Description']))]

Output-

[u'Height: 6-2- Width: 8-8- Length: 8-8- Energy Star: No- Energy Saver: No- UL Classification: UL Certified- UL Application: Dry SKU: CNCD467']
[u'Height', u' 6-2- Width', u' 8-8- Length', u' 8-8- Energy Star']
[(u'6-2',), (u'8-8',), (u'8-8',)]
SIslam
  • 4,986
  • 1
  • 21
  • 31
  • I didn't understand the last print statement, how did u extract. it will be helpful if u can explain briefly. – user5431918 Dec 14 '15 at 04:07
  • `r'\w+:'` means to split a string by `:` that have one or more word char i.e. a-z and 0-9 before. – SIslam Dec 14 '15 at 04:14