1

We're scrubbing a ridiculous amount of data, and am finding many examples of clean data that are left with irrelevant punctuation at the beginning and end of the final string. Quotes and DoubleQuotes are fine, but leading/trailing dashes, commas, etc need to be removed

I've studied the answer at How can I remove all leading and trailing punctuation?, but am unable to find a way to accomplish the same in PHP.

- some text.                dash and period should be removed
"Some Other Text".          period should be removed
it's a matter of opinion    apostrophe should be kept
/ some more text?           Slash should be removed and question mark kept

In short,

  • Certain punctuation occurring BEFORE the first AlphaNumeric character must be removed
  • Certain punctuation occurring AFTER the last AlphaNumeric character must be removed

How can I accomplish this with PHP - the few examples I've found surpass my RegEx/JS abilites.

Community
  • 1
  • 1
GDP
  • 7,641
  • 6
  • 40
  • 73
  • Do you want to keep the space after the dash in `- some text.` and after the slash in `/ some more text?` or should it be removed as well? – Aran-Fey Oct 06 '14 at 14:03
  • ultimately, everything should be trimmed, no leading/trailing spaces, but our PHP routines do that before saving anyways. – GDP Oct 06 '14 at 14:05
  • `s|^[/\s\-]||`, `s|[.\s/\-]$||`, basically – Marc B Oct 06 '14 at 14:07

3 Answers3

1

This is an answer without regex.

You can use the function trim (or a combination of ltrim/rtrim to specify all characters you want to remove. For your example:

$str = trim($str, " \t\n\r\0\x0B-.");

(As I suppose you also want to remove spacing and newlines at the begin/end, I left the default mask)

See also rtrim and ltrim if you don't want to remove the same charlist at the beginning and the end of your strings.

Asenar
  • 5,861
  • 2
  • 31
  • 46
  • Taht would trim trailing Periods as well as leading Periods though, wouldn't it? – GDP Oct 06 '14 at 14:10
  • Yes, but you can also apply the 2 functions `ltrim` with period, then `rtrim` without if that's your needs. This will be much quicker than functions involving regular expressions – Asenar Oct 06 '14 at 14:12
  • True...all these years with PHP, and I didn't even know you could specify what to trim! Regex is heavily in play anyway, so that's where my head has been. – GDP Oct 06 '14 at 14:14
  • I love to play with regex, but sometimes it's simply does not worth the cost ;) – Asenar Oct 06 '14 at 14:17
  • 1
    One day I'll properly learn Regex, but want to learn Mandariin first - it's seems much easier, lol – GDP Oct 06 '14 at 14:20
  • I have to accept the answer from php_nub_qq since it actually answers the question asked, but yours is certainly the best solution to the problem I'm solving. – GDP Oct 06 '14 at 14:40
  • That's fine, if the problem really requires regex, yes that's his answer ;) . About it, notice you can simplify the regex by using an other char than `/` for boudaries. – Asenar Oct 06 '14 at 14:45
0

You can modify the pattern to include characters.

$array = array(
    '- some text.',
    '"Some Other Text".',
    'it\'s a matter of opinion',
    '/ some more text?'
);

foreach($array as $key => $string){
    $array[$key] = preg_replace(array(
        '/^[\.\-\/]*/',
        '/[\.\-\/]*$/'
    ), array('', ''), $string);
}

print_r($array);
php_nub_qq
  • 12,762
  • 17
  • 59
  • 123
  • Trying to understand this...how would it be modified to allow a trailing "?", but not a leading "?". – GDP Oct 06 '14 at 14:22
  • 2
    @GDP The two lines that look like they're some subtitle censor of a terminator are the leading and trailing patterns. In between `[` and `]` you put characters you want to be removed. The characters you see are escaped with a backslash because they are special characters for regex. – php_nub_qq Oct 06 '14 at 14:24
  • 1
    preg_replace('#^[ ./?-]*|[ ./-]*$#', '', $string); – OIS Oct 06 '14 at 14:25
  • @OIS this pattern is not doing what the op is asking for http://regex101.com/r/iC6gJ5/1 please don't confuse them needlessly. – php_nub_qq Oct 06 '14 at 14:28
  • I don't think you need to escape the dot character inside character class. @GDP To replace trailing `?`, just add `?` inside the second character class (inside the brackets, just before `]` for example) – Asenar Oct 06 '14 at 14:28
  • @php_nub_qq I don't know about that site but it works in PHP. – OIS Oct 06 '14 at 14:33
  • Though I may implement the answer from @Asenar in this case, I LOVE the elegance of yours, and it does actually answer the question asked. If you're ever in Seattle, I'll buy Starbucks all day long if you share some of that Regex sorcery with me, lol. – GDP Oct 06 '14 at 14:39
  • 1
    @GDP Hehe, to be honest I'm not quite good with regex either, I dislike it because it is very slow compared to simple bit operations ( `str_*` functions mainly ). How I build regex is I open one of these sites, most commonly `regex101` because of it's nice design, and put a bunch of characters in the pattern until it matches. Of course you need to know at least what they do `:D` Glad I could help. – php_nub_qq Oct 06 '14 at 14:45
  • 1
    I agree with what @OIS added in comments: with its pattern, there is no need to escape any character (neither `/` because boundaries are `#`, neither `-` because placed at the end of the character class :) . Of course to deal with leading/trailing char it has to be used in an array, like written in the anser ;) – Asenar Oct 06 '14 at 14:48
0

If the punctuation could be more than one character, you could do this

function trimFormatting($str){ // trim 
    $osl = 0;
    $pat = '(<br>|,|\s+)';
    while($osl!==strlen($str)){
        $osl = strlen($str);
        $str =preg_replace('/^'.$pat.'|'.$pat.'$/i','',$str); 
    }
return $str;
}
echo trimFormatting('<BR>,<BR>Hello<BR>World<BR>, <BR>'); 

// will give "Hello<BR>World"

The routine checks for "<BR>" and "," and one or spaces ("\s+"). The "|" being the OR operator used three times in the routine. It trims both at the start "^" and the end "$" at the same time. It keeps looping through this until no more matches are trimmed off (i.e. there is no further reduction in string length).

will
  • 85
  • 2
  • 4