0

I am trying to extract the physical dimensions of items in a dataset from an unformatted string description. There are quite a few different ways they are expressed in the string. Here are some examples:

2” (7cm) high, 3” (9cm) long and 2” (7cm) wide
7” (20cm) high, 5” (15cm) wide and 5” (13cm) deep
4” high, 7” wide and 5” deep
6 inches high, 17 inches wide, and 6 inches deep

I am trying to extract them in the most elegant way possible using just a single regular expression for each dimension, ideally, but I can't seem to wrap my head around how to do it and I don't even know where to start, really. I am using a pandas DataFrame and the extract() method, if that makes a difference. Here is what I have so far:

r'(?P<height_cm>\d+)cm\) high'
r'?P<width_cm>\d+)cm\) wide'
r'(?P<length_cm>\d+)cm\) [deep|long]'

But this obviously only captures the cm numbers. How can I also capture the inches if present? And how can I use either the inch symbol or the word inches so that they both match?

Any help would be greatly appreciated.

Simon
  • 353
  • 1
  • 8
  • You seem you want to extract cms values, however cms are not available on all data. I don't think you want to mix cms and inches values. So, what are you trying to extract? Can you give us an example of the expected data to be extracted? – Julio Feb 05 '20 at 09:25
  • Do you mean you have several matches per line? – Wiktor Stribiżew Feb 05 '20 at 09:25
  • AFAIK `2”` =~ `5.08 cm` – Toto Feb 05 '20 at 09:56
  • Not all lines contain both inches and centimeters so I intend to extract both and then keep only the more frequently mentioned unit (and convert if necessary) later on in the processing pipeline. – Simon Feb 05 '20 at 12:49

2 Answers2

2

Following examples given (assuming deep and long are the same dimension):

(?:(?:((?:(?P<height_inch>\d+)(?:”| inches))(?: \((?P<height_cm>\d+)(?:\s?cm)\))? high)|((?:(?P<deep_inch>\d+)(?:”| inches))(?: \((?P<deep_cm>\d+)(?:\s?cm)\))? (?:deep|long))|((?:(?P<wide_inch>\d+)(?:”| inches))(?: \((?P<wide_cm>\d+)(?:\s?cm)\))? wide)).*?)+

Edit: above regex updated to work with re.fullmatch and Series.str.extractall

This one might be simpler to work with:

((?:(?P<inch>\d+)(?:”| inches))(?: \((?P<cm>\d+)(?:\s?cm)\))? (?P<side>high|wide|deep|long))

Also use with Series.str.extractall

On regex101

VincentRG
  • 79
  • 4
0

Note that these regular expressions will also match strings of the format x inches (ycm). I am assuming that's not a problem.

r'(?P<height_inches>\d+)(\”|\sinches)(\((P<height_cm>\d+)\scm\))?\shigh'
r'(?P<width_inches>\d+)(\”|\sinches)(\((P<width_cm>\d+)\scm\))?\swide'
r'(?P<length_inches>\d+)(\”|\sinches)(\((P<length_cm>\d+)\scm\))?\s(deep|long)'
Arjun Kay
  • 268
  • 1
  • 11