I am trying to extract the physical dimensions of items in a dataset from an unformatted string description. There are quite a few different ways they are expressed in the string. Here are some examples:
2” (7cm) high, 3” (9cm) long and 2” (7cm) wide
7” (20cm) high, 5” (15cm) wide and 5” (13cm) deep
4” high, 7” wide and 5” deep
6 inches high, 17 inches wide, and 6 inches deep
I am trying to extract them in the most elegant way possible using just a single regular expression for each dimension, ideally, but I can't seem to wrap my head around how to do it and I don't even know where to start, really. I am using a pandas DataFrame and the extract() method, if that makes a difference. Here is what I have so far:
r'(?P<height_cm>\d+)cm\) high'
r'?P<width_cm>\d+)cm\) wide'
r'(?P<length_cm>\d+)cm\) [deep|long]'
But this obviously only captures the cm numbers. How can I also capture the inches if present? And how can I use either the inch symbol or the word inches so that they both match?
Any help would be greatly appreciated.