0

I am currently working on some legacy code (java project), and a lot of variables (15k) have an underscore as their first char, e.g:

_iAmAInt //should be iAmInt

(all variables to be replaced start with _ followed by a lower case letter)

So i thought i would try to clean this using a small script, with sed and regex, so far this is what I've got:

while IFS= read -r -d '' file; do
   if [[ $file == *.java ]]; then 

        sed -i -E 's/([_])([a-z])/\2/g' $file

    fi  
done < <(find "$1" -type f -print0)

The thing is in some cases, I have some strings (for example queries) which have something like this: "select house_id from houses"

My current regex doesn't take this into account, but obviously I need to specify somehow that the _ that are between " " are not be to be deleted.

From what I've read, I could use a negative lookahead (Regex: match everything but specific pattern)

But I'm not quite sure this would solve completely my issue, or even if the whole process is a good idea ?

Any hints or feedbacks, on how to proceed and what to do or not are welcome ! Thanks

Edit: Yes the code is Java, and SonarQube flags this is as an issue (even though, it's not really critical)

Edit 2: Thanks for all the answers and comments, I've learned a lot, I'll try them out and make sure to choose one as a valid answer !

Youri
  • 372
  • 3
  • 13
  • What does this have to do with [tag:java]? – Jacob G. Mar 20 '19 at 13:02
  • 4
    If you have 15k variables that start with `_` the odds are very slim that you can blindly remove all of the underscores and not end up with the new names clashing with existing variable, function, etc. names. Also starting some types of variables with `_` (or double `_`) is a common naming convention in some languages - you may lose useful information if you strip them all. – Ed Morton Mar 20 '19 at 13:02
  • every company i have worked with uses `_` at the beginning of a variable to show that it is private. `_` is not a bad thing. It doesn't even violate Java naming conventions. – Dylan Mar 20 '19 at 13:04
  • @JacobG. well it has to do that if it was C, the naming convention is different, but indeed, this is is not related to "programming" in java, if this is what you meant – Youri Mar 20 '19 at 13:05
  • Do you tag [java] because the code in question is written in Java? Because in some other languages (e.g. C), certain language keywords and other reserved identifiers start with an underscore, and removing that would break the code. – John Bollinger Mar 20 '19 at 13:05
  • @JohnBollinger Yes sir ! I edited the question as it was not quite clear, thanks :) – Youri Mar 20 '19 at 13:07
  • To quote an old style guide: _"When modifying existing software your changes should follow the style of the original code. Do not introduce a new coding style in a modification and do not attempt to rewrite the old software just to make it match the new style."_ – jas Mar 20 '19 at 13:08
  • I do have a question. In your regex, you assume that the variable starts with `_` followed by a **lower case** alphabetic character. Are you certain this covers all of the variables you want to change? Because `_9` is a valid variable, also `_B` is valid. Are there any variables like that? – Dylan Mar 20 '19 at 13:09
  • @Dylan No, there should be no variables such as the one you described (one or two might have slipped in though). However the constants are usualy defined like I_AM_A_CONSTANT = " "; , that's why i'm only using the lowercase in my regex :) – Youri Mar 20 '19 at 13:12
  • 1
    Although I'm sure you can do this job with `sed`, the flavor of regular expressions it recognizes is not expressive enough to make the job straightforward. On the other hand, `awk` regexes recognize escape sequences matching zero-length word-start and word-end boundaries (`\`), and these, especially the first, would make the job easier. – John Bollinger Mar 20 '19 at 13:14
  • @JohnBollinger `\` are GNU awk word boundaries which are also supported by GNU sed (which the OP is apparently using given her use of `-i -E` args). – Ed Morton Mar 20 '19 at 13:20
  • Hmm, @EdMorton, the docs for my version of GNU `sed` (4.2.2) do not describe either the word boundary escapes or a `-E` option. Dunno whether the docs are deficient or whether those are newer features. I know v4.2.2 is several years old, but it is still what comes standard in the latest versions of RHEL & co.. – John Bollinger Mar 20 '19 at 13:36
  • @JohnBollinger `-E` is a somewhat newer feature and was implemented but undocumented for a while in GNU sed but I couldn't give you C&V on that or the word boundaries. You could try `echo 'foo.bar.abc' | sed -E 's/\/xxx/'` and if it outputs `foo.xxx.abc` then your sed supports both. – Ed Morton Mar 20 '19 at 13:44
  • 1
    Thanks, @EdMorton. It looks like v4.2.2, as provided on EL7, does support both, even though they are not documented. – John Bollinger Mar 20 '19 at 13:50

3 Answers3

2
> sed -E 's/("([^"\\]|\\.)*")|_([a-z0-9]+)|([a-z][a-z0-9_]+)/\1\3\4/g'
foo _bar foo_bar " \" _zoo \" "
foo bar foo_bar " \" _zoo \" "

First group captures string literals, third group captures identifiers starting with underscore but without leading underscore, fourth group captures all other identifiers. Fourth group is need to avoid removing underscore in the middle of identifiers.

Mikhail Vladimirov
  • 12,571
  • 1
  • 31
  • 34
1

Although I remarked in comments that sed's regular expressions are a bit lacking for this job, I realized that sed can still do it without too much muss. The trick would be to first protect the underscores you want to keep, then remove the others, then restore the protected ones. Kind of an organic chemistry approach to the problem, if you will.

For this purpose, you can rely on the fact that there is one character that will never be in sed's pattern space unless put there by a sed command: the newline. sed strips them on input and (normally) emits new ones on output, but if they do end up in the pattern space then they are not otherwise special. So consider this:

sed -i -e 's/([^ \t])_/\1\n/g; s/_([a-z])/\1/g; s/\n/_/g' "$file"

There are three substitutions performed:

  1. every underscore not immediately following a space or tab is replaced by a newline;
  2. (a variation on your original regex:) every underscore followed by a lowercase Latin letter is removed; and
  3. every newline is replaced with an underscore.

Remember, again, that sed strips newlines on input and appends new ones on ordinary output, so the only newlines available for replacement in (3) are those that were introduced in (1) to hide underscores that you want to protect from the substitution in (2).

John Bollinger
  • 121,924
  • 8
  • 64
  • 118
0

Note that you may have a variable like _return, where removing _ will lead to a keyword.

This operation can be done easily with perl because PCRE have more features than sed regexes.

Examples

to grep, just display matches.

# where ... are find options e.g. `-name '*.java'`
find "$1" -type f ... -exec perl -ne 'print "$ARGV:$_" if /"(?:\\.|[^"])*"(*SKIP)(?!)|\b_[a-z]/' {} +

to change files in place: (-i like sed -i.bak, in perl by default original files are moved to .bak)

find "$1" -type f ... -exec perl -i -pe 's/"(?:\\.|[^"])*"(*SKIP)(?!)|\b_(?=[a-z])//g' {} +

to revert : replace by .bak files

find "$1" -type f ... -name '*.bak' -exec bash -c 'for f; do mv "$f" "${f%.bak}"; done' bash {} +

to delete .bak files

find "$1" -type f ... -name '*.bak' -delete

How regex works

  • "(?:\\.|[^"])*" : matches a string literal ".." that may contain \" sequence
  • (*SKIP)(?!)| : backtracking keyword to discard this match :
    • (*SKIP) prevent backtraking before current position in matching string
    • (?!) to fail match
    • | to try to match with the following pattern
  • \b_(?=[a-z]) : to match _ preceded by word boundary (as a word character preceded by a non word character) and followed by a lowercase letter ([a-z])
Nahuel Fouilleul
  • 16,821
  • 1
  • 26
  • 32