379

I am trying to write a bash script for testing that takes a parameter and sends it through curl to web site. I need to url encode the value to make sure that special characters are processed properly. What is the best way to do this?

Here is my basic script so far:

#!/bin/bash
host=${1:?'bad host'}
value=$2
shift
shift
curl -v -d "param=${value}" http://${host}/somepath $@
Mateusz Piotrowski
  • 6,087
  • 9
  • 44
  • 71
Aaron
  • 16,951
  • 4
  • 26
  • 23
  • See also: [How to decode URL-encoded string in shell?](http://stackoverflow.com/q/6250698/55075) for non-curl solutions. – kenorb Mar 01 '15 at 17:30
  • See also: [How can I encode and decode percent-encoded strings on the command line?](https://askubuntu.com/questions/53770/how-can-i-encode-and-decode-percent-encoded-strings-on-the-command-line) – Anton Tarasenko May 22 '18 at 19:17

35 Answers35

468

Use curl --data-urlencode; from man curl:

This posts data, similar to the other --data options with the exception that this performs URL-encoding. To be CGI-compliant, the <data> part should begin with a name followed by a separator and a content specification.

Example usage:

curl \
    --data-urlencode "paramName=value" \
    --data-urlencode "secondParam=value" \
    http://example.com

See the man page for more info.

This requires curl 7.18.0 or newer (released January 2008). Use curl -V to check which version you have.

You can as well encode the query string:

curl -G \
    --data-urlencode "p1=value 1" \
    --data-urlencode "p2=value 2" \
    http://example.com
    # http://example.com?p1=value%201&p2=value%202
x-yuri
  • 11,554
  • 9
  • 75
  • 122
Jacob Rask
  • 17,543
  • 7
  • 35
  • 35
  • 6
    Seems to only work for http POST. Documentation here: http://curl.haxx.se/docs/manpage.html#--data-urlencode – Stan James Apr 13 '12 at 06:47
  • 87
    @StanJames If you use it like so curl can also do the encoding for a GET request. `curl -G --data-urlencode "blah=df ssdf sdf" --data-urlencode "blah2=dfsdf sdfsd " http://whatever.com/whatever` – kberg May 07 '12 at 20:52
  • 16
    @kberg actually, this will only work for query data. curl will append a '?' followed by the urlencoded params. If you want to urlencode some url postfix (such as a CouchDB GET for some document id), then '--data-urlencode' won't work. – Bokeh Aug 28 '12 at 22:41
  • 1
    Doesn't work for `curl --data-urlencode "description=![image]($url)" www.example.com`. Any idea why? ` – Khurshid Alam Jun 03 '16 at 20:37
  • 1
    I want to URL encode the URL path (which is used as a parameter in a REST API endpoint). There is no query string parameters involved. How do I do this for a GET request? – Web User Mar 31 '17 at 21:08
  • It doesn't work for --data-urlencode "key=special "string"" What are we missing here? – Nathan B Mar 06 '18 at 15:54
  • 2
    @NadavB Escaping the `"`‽ – BlackJack Apr 19 '18 at 09:33
  • Careful with `{}` and `[]` and curl. Use the `-g` option to turn OFF globbing. – Buzz Moschetti Nov 14 '20 at 23:17
  • `-G, --get : Put the post data in the URL and use GET` comment from @kberg works good – KingKongCoder Jan 17 '21 at 21:23
193

Here is the pure BASH answer.

rawurlencode() {
  local string="${1}"
  local strlen=${#string}
  local encoded=""
  local pos c o

  for (( pos=0 ; pos<strlen ; pos++ )); do
     c=${string:$pos:1}
     case "$c" in
        [-_.~a-zA-Z0-9] ) o="${c}" ;;
        * )               printf -v o '%%%02x' "'$c"
     esac
     encoded+="${o}"
  done
  echo "${encoded}"    # You can either set a return variable (FASTER) 
  REPLY="${encoded}"   #+or echo the result (EASIER)... or both... :p
}

You can use it in two ways:

easier:  echo http://url/q?=$( rawurlencode "$args" )
faster:  rawurlencode "$args"; echo http://url/q?${REPLY}

[edited]

Here's the matching rawurldecode() function, which - with all modesty - is awesome.

# Returns a string in which the sequences with percent (%) signs followed by
# two hex digits have been replaced with literal characters.
rawurldecode() {

  # This is perhaps a risky gambit, but since all escape characters must be
  # encoded, we can replace %NN with \xNN and pass the lot to printf -b, which
  # will decode hex for us

  printf -v REPLY '%b' "${1//%/\\x}" # You can either set a return variable (FASTER)

  echo "${REPLY}"  #+or echo the result (EASIER)... or both... :p
}

With the matching set, we can now perform some simple tests:

$ diff rawurlencode.inc.sh \
        <( rawurldecode "$( rawurlencode "$( cat rawurlencode.inc.sh )" )" ) \
        && echo Matched

Output: Matched

And if you really really feel that you need an external tool (well, it will go a lot faster, and might do binary files and such...) I found this on my OpenWRT router...

replace_value=$(echo $replace_value | sed -f /usr/lib/ddns/url_escape.sed)

Where url_escape.sed was a file that contained these rules:

# sed url escaping
s:%:%25:g
s: :%20:g
s:<:%3C:g
s:>:%3E:g
s:#:%23:g
s:{:%7B:g
s:}:%7D:g
s:|:%7C:g
s:\\:%5C:g
s:\^:%5E:g
s:~:%7E:g
s:\[:%5B:g
s:\]:%5D:g
s:`:%60:g
s:;:%3B:g
s:/:%2F:g
s:?:%3F:g
s^:^%3A^g
s:@:%40:g
s:=:%3D:g
s:&:%26:g
s:\$:%24:g
s:\!:%21:g
s:\*:%2A:g
Orwellophile
  • 11,307
  • 3
  • 59
  • 38
  • The [current version](https://dev.openwrt.org/browser/packages/net/ddns-scripts/files/usr/lib/ddns/url_escape.sed) does not suffer from that bug. – user123444555621 Sep 02 '12 at 22:32
  • +1 for the bash implementation of rawurldecode, didn't know printf did %x or -v – nhed Sep 11 '12 at 17:37
  • @Pumbaa80 sorry, my fault for still running Backfire 10.03 r20728. Thanks for editing, I thought I was going to actually have to use my brain for a second :p – Orwellophile Sep 24 '12 at 13:02
  • @Pumbaa80: just checked the first edits (rawurldecode), but all tha red and green hurt my thinker - but the version that is up now is 100%. so if i previously wrote anything different to that, i was high on glue. thanks for catching it :) – Orwellophile Sep 28 '12 at 05:52
  • 5
    Unfortunately, this script fails on some characters, such as 'é' and '½', outputting 'e%FFFFFFFFFFFFFFCC' and '%FFFFFFFFFFFFFFC2', respectively (b/c of the per-character loop, I believe). – Matthemattics Mar 24 '14 at 17:13
  • @Lübnah: You must be using BASH 3.x, works great in BASH 4. I don't know an easy way to fix it, i.e. _printf '%%%d' "'é"_ – Orwellophile Apr 11 '14 at 15:19
  • @Lübnah: sorry, i tried to find an quick fix for BASH 3, but I couldn't. I'm afraid you'll just have to use a slower method from elsewhere on this page. Or post a new question, asking for my answer to be made to work with BASH 3 :) – Orwellophile Apr 25 '14 at 09:24
  • 1
    It fails to work for me in Bash 4.3.11(1). The string `Jogging «à l'Hèze»` generates `Jogging%20%abà%20l%27Hèze%bb` that cannot be feed to JS `decodeURIComponent` :( – dmcontador Nov 19 '15 at 12:07
  • 2
    In that first block of code what does the last parameter to printf mean? That is, why is it double-quote, single-quote, dollar-sign, letter-c, double-quote? Does does the single-quote do? – Colin Fraizer May 19 '16 at 14:31
  • 1
    @dmcontador - it's only a humble bash script, it has no conception of multi-byte characters, or unicode. When it see's a character like ń (`\u0144`) it will naively output %144, ╡(`\u2561`) will be output as %2561. The correct *rawurlencoded* answers for these would be %C5%84%0A and %E2%95%A1 respectively. – Orwellophile Jun 08 '16 at 09:49
  • 1
    @ColinFraizer the single quote serves to convert the following character into its numeric value. ref. http://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html#tag_20_94_13 – Sam Nov 22 '18 at 22:37
  • @Matthematics, @dmcontador, @Orwellophile: The issue with special characters lays in `[A-Za-z0-9]`. I have no idea why, but `[A-Za-z]` matches characters with diacritics and `[0-9]` matches number-like characters too (e.g. `½`). Easy fix is to list all the Latin chars and arabic numbers char by char. See modified version of the script [here](https://gitlab.com/tukusejssirs/lnx_scripts/blob/master/bash_functions/rawurlencode.sh). And I have no idea what the minimal Bash version is, but it works on `v5.0.7(1)`. It can process `Jogging «à l'Hèze»`, `é` and `½` or any other chars with diacritics. – tukusejssirs Oct 13 '19 at 17:48
  • 2
    @Matthematics, @dmcontador, @Orwellophile: I was wrong in my [previous comment](https://stackoverflow.com/questions/296536/how-to-urlencode-data-for-curl-command/10660730#comment103082857_10660730). [Solution](https://stackoverflow.com/a/7506695/3408342) using `xxd` is beter and works in any case (for any character). I have updated [my script](https://gitlab.com/tukusejssirs/lnx_scripts/blob/master/bash_functions/rawurlencode.sh). Anyway, it looks like the `rawurldecode()` function works exceptionally well. :) – tukusejssirs Oct 13 '19 at 21:42
  • 1
    @tukusejssirs the link to your script has been updated to: https://gitlab.com/tukusejssirs/lnx_scripts/-/blob/master/bash/functions/rawurlencode.sh Moreover you are missing the character 'r', sent a PR: https://gitlab.com/tukusejssirs/lnx_scripts/-/merge_requests/1 Thanks! – whizzzkid Apr 08 '20 at 07:16
  • With proper UTF-8 encoding support: `rawurlencode() { local LANG=C ; local IFS= ; while read -n1 -r -d "$(echo -n "\000")" c ; do case "$c" in [-_.~a-zA-Z0-9]) echo -n "$c" ;; *) printf '%%%02x' "'$c" ;; esac ; done }`. Then: `echo -n "Jogging «à l'Hèze»." | rawurlencode` produces `Jogging%20%c2%ab%c3%a0%20l%27H%c3%a8ze%c2%bb.` as expected. – vladr Mar 01 '21 at 05:22
131

Another option is to use jq:

$ printf %s 'encode this'|jq -sRr @uri
encode%20this
$ jq -rn --arg x 'encode this' '$x|@uri'
encode%20this

-r (--raw-output) outputs the raw contents of strings instead of JSON string literals. -n (--null-input) doesn't read input from STDIN.

-R (--raw-input) treats input lines as strings instead of parsing them as JSON, and -sR (--slurp --raw-input) reads the input into a single string. You can replace -sRr with -Rr if your input only contains a single line, or if you don't want to replace linefeeds with %0A:

$ printf %s\\n 'multiple lines' 'of text'|jq -Rr @uri
multiple%20lines
of%20text
$ printf %s\\n 'multiple lines' 'of text'|jq -sRr @uri
multiple%20lines%0Aof%20text%0A

Or this percent-encodes all bytes:

xxd -p|tr -d \\n|sed 's/../%&/g'
nisetama
  • 4,817
  • 1
  • 27
  • 18
  • 6
    <3 it ... should be top & accepted IMO (yeah if you can tell `curl` to encode that works and if bash has a builtin that would have been acceptable - but `jq` seems like a right fit tho i'm far from attaining comfort level with this tool) – nhed Nov 16 '17 at 16:16
  • 7
    for anyone wondering the same thing as me: `@uri` is not some variable, but a literal jq filter used for formatting strings and escaping; see [jq manual](https://stedolan.github.io/jq/manual) for details (sorry, no direct link, need to search for `@uri` on the page...) – ssc Jul 13 '18 at 11:48
  • the xxd version is just the kind of thing I was looking for. Even if it's a little dirty, it's short and has no dependencies – Rian Sanderson Nov 21 '18 at 15:08
  • 4
    A sample usage of jq to url-encode: `printf "http://localhost:8082/" | jq -sRr '@uri' ` – Ashutosh Jindal Aug 07 '19 at 21:57
  • I think the only reason this isn't the top answer is because OP specifically asked about `curl`. You wouldn't loop in a second tool, `jq`, if curl can do it alone. However, this is _awesome_ general utility to pipe output to, in a script or on the command line. – Jameson Dec 09 '20 at 19:16
94

Use Perl's URI::Escape module and uri_escape function in the second line of your bash script:

...

value="$(perl -MURI::Escape -e 'print uri_escape($ARGV[0]);' "$2")"
...

Edit: Fix quoting problems, as suggested by Chris Johnsen in the comments. Thanks!

dubek
  • 10,045
  • 3
  • 25
  • 22
  • 2
    URI::Escape might not be installed, check my answer in that case. – blueyed Nov 10 '09 at 19:50
  • I fixed this (use `echo`, pipe and `<>`), and now it works even when $2 contains an apostrophe or double-quotes. Thanks! – dubek Jan 03 '10 at 09:35
  • 9
    You do away with `echo`, too: `value="$(perl -MURI::Escape -e 'print uri_escape($ARGV[0]);' "$2")"` – Chris Johnsen Jan 03 '10 at 10:31
  • 1
    Chris Johnsen's version is better. I had ${True} in my test expression and using this via echo tripped up uri_escape / Perl variable expansion. – mm2001 Jan 07 '10 at 16:35
  • @thecoshman are you suggesting that you want to see an answer written in vanilla bash without any outside utilities, e.g. bash4, perl, awk, cat, cut, xxd, etc.? (see other answers below). (Virtually) anything you write will have some external/version dependency since vanilla bash is not that powerful. Might was well learn to live with that. Or perhaps, with herculean effort you could achieve it. Would you really want to do that rather than write something much, much simpler in perl/python/awk/etc.? – jrw32982 Aug 25 '14 at 22:35
  • @jrw32982 maybe I badly expressed my self. Things like sed, awk, cat are tools you can 99.999% guarantee are installed, perl however may not always be on your machine. Yes perl is a fine solution here, but isn't much help in the (not that unlikely) situation you don't/can't have perl (up tight admins for instance). – thecoshman Aug 26 '14 at 07:53
  • @thecoshman My point was that even scripts that use only the set of tools that you claim are in the 99.999% range will be subject to version differences from machine to machine. Witness the bash4 vs bash3 comments for Orwellophile's answer. So, I disagree that this is a poor answer due to using perl. It's just an answer with pre-reqs, just like virtually every answer will be. FWIW, having perl installed is in the 99%+ range in my experience on Linux, Solaris, AIX, HP/UX. YMMV. – jrw32982 Aug 26 '14 at 15:56
  • 1
    @jrw32982 yeah, looking back at it, having another language with which to accomplish this task is good. If I could, I'd take back my downvote, but alas it is currently locked in. – thecoshman Aug 26 '14 at 18:36
73

One of variants, may be ugly, but simple:

urlencode() {
    local data
    if [[ $# != 1 ]]; then
        echo "Usage: $0 string-to-urlencode"
        return 1
    fi
    data="$(curl -s -o /dev/null -w %{url_effective} --get --data-urlencode "$1" "")"
    if [[ $? != 3 ]]; then
        echo "Unexpected error" 1>&2
        return 2
    fi
    echo "${data##/?}"
    return 0
}

Here is the one-liner version for example (as suggested by Bruno):

date | curl -Gso /dev/null -w %{url_effective} --data-urlencode @- "" | cut -c 3-

# If you experience the trailing %0A, use
date | curl -Gso /dev/null -w %{url_effective} --data-urlencode @- "" | sed -E 's/..(.*).../\1/'
Bruno Bronosky
  • 54,357
  • 9
  • 132
  • 120
Sergey
  • 739
  • 5
  • 3
  • 1
    I think this is a very clever way to reuse cURL's URL encoding. – solidsnack Oct 24 '12 at 15:17
  • 15
    This is absolutely brilliant! I really wish you had left it a one line so that people can see how simple it really is. To URL encode the result of the `date` command… `date | curl -Gso /dev/null -w %{url_effective} --data-urlencode @- "" | cut -c 3-` (You have to `cut` the first 2 chars off, because curl's output is a technically a relative URL with a query string.) – Bruno Bronosky Mar 02 '13 at 03:07
  • I love how clever this is, but it doesn't seem to work with my curl 7.33.0. Works in a debian wheezy box that has curl 7.26.0. – dequis Jan 24 '14 at 16:37
  • @dequis, hmm -- if there's a bug, it's one that was fixed since then (works fine with a macports build of curl 7.48.0 on OS X). Can you isolate how behavior has changed? – Charles Duffy May 18 '16 at 20:49
  • @CharlesDuffy dunno what the bug was, sorry. Just tested with both 7.38.0 and 7.47.1 and it works. – dequis May 22 '16 at 03:05
  • 2
    @BrunoBronosky Your one-liner variant is good but seemingly adds a "%0A" to the end of the encoding. Users beware. The function version does not seem to have this issue. – levigroker Aug 10 '16 at 17:25
  • 8
    To avoid `%0A` at the end, use `printf` instead of `echo`. – kenorb May 02 '18 at 00:11
  • 2
    the one liner is fantastic – Stephen Blum Aug 30 '18 at 23:31
  • 1
    This approach is great but it relies on a `curl` command that returns a non-zero code. This can be a problem if `errexit` and `pipefail` options are active. I solved this particular issue by adding `|| true` to allow the one liner to fail: `printf "%s" "${1:-}" | curl -Gso /dev/null -w "%{url_effective}" --data-urlencode @- "" | cut -c 3- || true` – Alan T. Apr 12 '20 at 01:19
72

for the sake of completeness, many solutions using sed or awk only translate a special set of characters and are hence quite large by code size and also dont translate other special characters that should be encoded.

a safe way to urlencode would be to just encode every single byte - even those that would've been allowed.

echo -ne 'some random\nbytes' | xxd -plain | tr -d '\n' | sed 's/\(..\)/%\1/g'

xxd is taking care here that the input is handled as bytes and not characters.

edit:

xxd comes with the vim-common package in Debian and I was just on a system where it was not installed and I didnt want to install it. The altornative is to use hexdump from the bsdmainutils package in Debian. According to the following graph, bsdmainutils and vim-common should have an about equal likelihood to be installed:

http://qa.debian.org/popcon-png.php?packages=vim-common%2Cbsdmainutils&show_installed=1&want_legend=1&want_ticks=1

but nevertheless here a version which uses hexdump instead of xxd and allows to avoid the tr call:

echo -ne 'some random\nbytes' | hexdump -v -e '/1 "%02x"' | sed 's/\(..\)/%\1/g'
josch
  • 5,737
  • 2
  • 33
  • 38
  • 1
    `xxd -plain` should happen AFTER `tr -d '\n'` ! – qdii Jul 08 '12 at 16:24
  • 3
    @qdii why? that would not only make it impossible to urlencode newlines but it would also wrongly insert newlines created by xxd into the output. – josch Jul 14 '12 at 16:26
  • 1
    @josch. This is just plain wrong. First, any `\n` characters will be translated by `xxd -plain` into `0a`. Don’t take my word for it, try it yourself: `echo -n -e '\n' | xxd -plain` This proves that your `tr -d '\n'` is useless here as there cannot be any `\n` after `xxd -plain` Second, `echo foobar` adds its own `\n` character in the end of the character string, so `xxd -plain` is not fed with `foobar` as expected but with `foobar\n`. then `xxd -plain` translates it into some character string that ends in `0a`, making it unsuitable for the user. You could add `-n` to `echo` to solve it. – qdii Jul 14 '12 at 22:49
  • 7
    @qdii indeed -n was missing for echo but the `xxd` call belongs in front of the `tr -d` call. It belongs there so that any newline in `foobar` is translated by `xxd`. The `tr -d` after the `xxd` call is to remove the newlines that xxd produces. It seems you never have foobar long enough so that `xxd` produces newlines but for long inputs it will. So the `tr -d` is necessary. In contrast to your assumption the `tr -d` was NOT to remove newlines from the input but from the `xxd` output. I want to keep the newlines in the input. Your only valid point is, that echo adds an unnecessary newline. – josch Jul 20 '12 at 09:44
  • 1
    @qdii and no offence taken - I just think that you are wrong, except for the `echo -n` which I was indeed missing – josch Jul 20 '12 at 09:53
56

I find it more readable in python:

encoded_value=$(python -c "import urllib; print urllib.quote('''$value''')")

the triple ' ensures that single quotes in value won't hurt. urllib is in the standard library. It work for exampple for this crazy (real world) url:

"http://www.rai.it/dl/audio/" "1264165523944Ho servito il re d'Inghilterra - Puntata 7
sandro
  • 585
  • 4
  • 2
  • 2
    I had some trouble with quotes and special chars with the triplequoting, this seemed to work for basically everything: encoded_value="$( echo -n "${data}" | python -c "import urllib; import sys; sys.stdout.write(urllib.quote(sys.stdin.read()))" )"; – Stop Slandering Monica Cellio Nov 14 '11 at 14:33
  • 2
    Python 3 version would be `encoded_value=$(python3 -c "import urllib.parse; print (urllib.parse.quote('''$value'''))")`. – Creshal Nov 10 '13 at 11:33
  • The `urllib.parse.quote` does not encode forward slashes '/'. `urlencode() { python3 -c 'import urllib.parse; import sys; print(urllib.parse.quote(sys.argv[1], safe=""))' "$1" }` – Evgeniy Generalov Apr 13 '14 at 08:47
  • 1
    `python -c 'import urllib, sys; sys.stdout.writelines(urllib.quote_plus(l, safe="/\n") for l in sys.stdin)'` has almost no quoting problems, and *should* be memory/speed efficient (haven't checked, save for squinting) – Alois Mahdal Nov 07 '15 at 05:19
  • 4
    It would be much safer to refer to `sys.argv` rather than substituting `$value` into a string later parsed as code. What if `value` contained `''' + __import__("os").system("rm -rf ~") + '''`? – Charles Duffy May 18 '16 at 20:45
  • 2
    `python -c "import urllib;print urllib.quote(raw_input())" <<< "$data"` – Rockallite Feb 09 '17 at 08:02
  • It's saying edit queue is full but it would be appropriate to use python3, sys.argv as others have noted, and remove the example because it doesn't work. – Tobias Feil Jan 29 '21 at 14:12
34

I've found the following snippet useful to stick it into a chain of program calls, where URI::Escape might not be installed:

perl -p -e 's/([^A-Za-z0-9])/sprintf("%%%02X", ord($1))/seg'

(source)

MDMower
  • 621
  • 7
  • 25
blueyed
  • 24,971
  • 4
  • 72
  • 70
  • 4
    worked for me. I changed it to perl -lpe ... (the letter ell). This removed the trailing newline, which I needed for my purposes. – JohnnyLambada Oct 17 '12 at 18:52
  • 3
    FYI, to do the inverse of this, use `perl -pe 's/\%(\w\w)/chr hex $1/ge'` (source: http://unix.stackexchange.com/questions/159253/decoding-url-encoding-percent-encoding) – Sridhar Sarnobat Nov 10 '15 at 19:46
  • 2
    Depending on specifically which characters you need to encode, you can simplify this to `perl -pe 's/(\W)/sprintf("%%%02X", ord($1))/ge'` which allows letters, numbers, and underscores, but encodes everything else. – robru Mar 04 '16 at 09:30
  • 1
    Thanks for response above! Since the use case is for curl: That is: `:` and `/` does not need encoding, my final function in my bashrc/zshrc is: `perl -lpe 's/([^A-Za-z0-9.\/:])/sprintf("%%%02X", ord($1))/seg` – Pham Dec 16 '20 at 04:32
  • where do you put the string to be encoded? – Tobias Feil Jan 29 '21 at 13:22
  • 1
    @TobiasFeil it comes from stdin. – blueyed May 25 '21 at 10:05
26

If you wish to run GET request and use pure curl just add --get to @Jacob's solution.

Here is an example:

curl -v --get --data-urlencode "access_token=$(cat .fb_access_token)" https://graph.facebook.com/me/feed
Piotr Czapla
  • 23,150
  • 23
  • 90
  • 120
16

This may be the best one:

after=$(echo -e "$before" | od -An -tx1 | tr ' ' % | xargs printf "%s")
chenzhiwei
  • 401
  • 4
  • 14
  • This works for me with two additions: 1. replace the -e with -n to avoid adding a newline to the end of the argument and 2. add '%%' to the printf string to put a % in front of each pair of hex digits. – Rob Fagen May 03 '16 at 23:26
  • works after add $ ahead bracket `after=$(echo -e ...` – Roman Rhrn Nesterov Sep 01 '16 at 08:22
  • 1
    Please explain how this works. The `od` command is not common. – Mark Stosberg Nov 19 '18 at 00:47
  • This does not work with OS X's `od` because it uses a different output format than GNU `od`. For example `printf aa|od -An -tx1 -v|tr \ -` prints `-----------61--61--------------------------------------------------------` with OS X's `od` and `-61-61` with GNU `od`. You could use `od -An -tx1 -v|sed 's/ */ /g;s/ *$//'|tr \ %|tr -d \\n` with either OS X's `od` or GNU `od`. `xxd -p|sed 's/../%&/g'|tr -d \\n` does the same thing, even though `xxd` is not in POSIX but `od` is. – nisetama Jan 08 '19 at 11:59
  • 2
    Although this might work, it escapes every single character – Charlie Oct 14 '19 at 08:25
15

Direct link to awk version : http://www.shelldorado.com/scripts/cmds/urlencode
I used it for years and it works like a charm

:
##########################################################################
# Title      :  urlencode - encode URL data
# Author     :  Heiner Steven (heiner.steven@odn.de)
# Date       :  2000-03-15
# Requires   :  awk
# Categories :  File Conversion, WWW, CGI
# SCCS-Id.   :  @(#) urlencode  1.4 06/10/29
##########################################################################
# Description
#   Encode data according to
#       RFC 1738: "Uniform Resource Locators (URL)" and
#       RFC 1866: "Hypertext Markup Language - 2.0" (HTML)
#
#   This encoding is used i.e. for the MIME type
#   "application/x-www-form-urlencoded"
#
# Notes
#    o  The default behaviour is not to encode the line endings. This
#   may not be what was intended, because the result will be
#   multiple lines of output (which cannot be used in an URL or a
#   HTTP "POST" request). If the desired output should be one
#   line, use the "-l" option.
#
#    o  The "-l" option assumes, that the end-of-line is denoted by
#   the character LF (ASCII 10). This is not true for Windows or
#   Mac systems, where the end of a line is denoted by the two
#   characters CR LF (ASCII 13 10).
#   We use this for symmetry; data processed in the following way:
#       cat | urlencode -l | urldecode -l
#   should (and will) result in the original data
#
#    o  Large lines (or binary files) will break many AWK
#       implementations. If you get the message
#       awk: record `...' too long
#        record number xxx
#   consider using GNU AWK (gawk).
#
#    o  urlencode will always terminate it's output with an EOL
#       character
#
# Thanks to Stefan Brozinski for pointing out a bug related to non-standard
# locales.
#
# See also
#   urldecode
##########################################################################

PN=`basename "$0"`          # Program name
VER='1.4'

: ${AWK=awk}

Usage () {
    echo >&2 "$PN - encode URL data, $VER
usage: $PN [-l] [file ...]
    -l:  encode line endings (result will be one line of output)

The default is to encode each input line on its own."
    exit 1
}

Msg () {
    for MsgLine
    do echo "$PN: $MsgLine" >&2
    done
}

Fatal () { Msg "$@"; exit 1; }

set -- `getopt hl "$@" 2>/dev/null` || Usage
[ $# -lt 1 ] && Usage           # "getopt" detected an error

EncodeEOL=no
while [ $# -gt 0 ]
do
    case "$1" in
        -l) EncodeEOL=yes;;
    --) shift; break;;
    -h) Usage;;
    -*) Usage;;
    *)  break;;         # First file name
    esac
    shift
done

LANG=C  export LANG
$AWK '
    BEGIN {
    # We assume an awk implementation that is just plain dumb.
    # We will convert an character to its ASCII value with the
    # table ord[], and produce two-digit hexadecimal output
    # without the printf("%02X") feature.

    EOL = "%0A"     # "end of line" string (encoded)
    split ("1 2 3 4 5 6 7 8 9 A B C D E F", hextab, " ")
    hextab [0] = 0
    for ( i=1; i<=255; ++i ) ord [ sprintf ("%c", i) "" ] = i + 0
    if ("'"$EncodeEOL"'" == "yes") EncodeEOL = 1; else EncodeEOL = 0
    }
    {
    encoded = ""
    for ( i=1; i<=length ($0); ++i ) {
        c = substr ($0, i, 1)
        if ( c ~ /[a-zA-Z0-9.-]/ ) {
        encoded = encoded c     # safe character
        } else if ( c == " " ) {
        encoded = encoded "+"   # special handling
        } else {
        # unsafe character, encode it as a two-digit hex-number
        lo = ord [c] % 16
        hi = int (ord [c] / 16);
        encoded = encoded "%" hextab [hi] hextab [lo]
        }
    }
    if ( EncodeEOL ) {
        printf ("%s", encoded EOL)
    } else {
        print encoded
    }
    }
    END {
        #if ( EncodeEOL ) print ""
    }
' "$@"
Pokechu22
  • 4,790
  • 8
  • 35
  • 59
MatthieuP
  • 1,106
  • 5
  • 12
12

Here's a Bash solution which doesn't invoke any external programs:

uriencode() {
  s="${1//'%'/%25}"
  s="${s//' '/%20}"
  s="${s//'"'/%22}"
  s="${s//'#'/%23}"
  s="${s//'$'/%24}"
  s="${s//'&'/%26}"
  s="${s//'+'/%2B}"
  s="${s//','/%2C}"
  s="${s//'/'/%2F}"
  s="${s//':'/%3A}"
  s="${s//';'/%3B}"
  s="${s//'='/%3D}"
  s="${s//'?'/%3F}"
  s="${s//'@'/%40}"
  s="${s//'['/%5B}"
  s="${s//']'/%5D}"
  printf %s "$s"
}
davidchambers
  • 20,922
  • 14
  • 68
  • 95
  • 5
    This behaves differently between the bash versions. On RHEL 6.9 the bash is 4.1.2 and it includes the single quotes. While Debian 9 and bash 4.4.12 is fine with the single quotes. For me removing the single quotes made it work on both. s="${s//','/%2C}" – Anton Krug May 23 '18 at 15:28
  • 3
    I updated the answer to reflect your finding, @muni764. – davidchambers May 23 '18 at 21:01
  • Just a warning... this won't encode things like the character `á` – diogovk Apr 27 '20 at 19:27
10
url=$(echo "$1" | sed -e 's/%/%25/g' -e 's/ /%20/g' -e 's/!/%21/g' -e 's/"/%22/g' -e 's/#/%23/g' -e 's/\$/%24/g' -e 's/\&/%26/g' -e 's/'\''/%27/g' -e 's/(/%28/g' -e 's/)/%29/g' -e 's/\*/%2a/g' -e 's/+/%2b/g' -e 's/,/%2c/g' -e 's/-/%2d/g' -e 's/\./%2e/g' -e 's/\//%2f/g' -e 's/:/%3a/g' -e 's/;/%3b/g' -e 's//%3e/g' -e 's/?/%3f/g' -e 's/@/%40/g' -e 's/\[/%5b/g' -e 's/\\/%5c/g' -e 's/\]/%5d/g' -e 's/\^/%5e/g' -e 's/_/%5f/g' -e 's/`/%60/g' -e 's/{/%7b/g' -e 's/|/%7c/g' -e 's/}/%7d/g' -e 's/~/%7e/g')

this will encode the string inside of $1 and output it in $url. although you don't have to put it in a var if you want. BTW didn't include the sed for tab thought it would turn it into spaces

Cody Gray
  • 222,280
  • 47
  • 466
  • 543
manoflinux
  • 101
  • 1
  • 3
  • 5
    I get the feeling this is *not* the recommended way to do this. – Cody Gray Jan 11 '11 at 13:27
  • 2
    explain your feeling please.... because I what I have stated works and I have used it in several scripts so I know it works for all the chars I listed. so please explain why someone would not use my code and use perl since the title of this is "URLEncode from a bash script" not a perl script. – manoflinux Feb 08 '11 at 02:55
  • sometimes no pearl solution is needed so this can come in handy – Yuval Rimar Oct 31 '11 at 11:31
  • 3
    This is not the recommended way to do this because blacklist is bad practice, and this is unicode unfriendly anyway. – Ekevoo Dec 20 '11 at 14:16
  • This was the most friendly solution compatible with cat file.txt – mrwaim Jan 20 '18 at 19:51
10

Using php from a shell script:

value="http://www.google.com"
encoded=$(php -r "echo rawurlencode('$value');")
# encoded = "http%3A%2F%2Fwww.google.com"
echo $(php -r "echo rawurldecode('$encoded');")
# returns: "http://www.google.com"
  1. http://www.php.net/manual/en/function.rawurlencode.php
  2. http://www.php.net/manual/en/function.rawurldecode.php
Darren Weber
  • 1,072
  • 13
  • 14
7

uni2ascii is very handy:

$ echo -ne '你好世界' | uni2ascii -aJ
%E4%BD%A0%E5%A5%BD%E4%B8%96%E7%95%8C
kev
  • 137,128
  • 36
  • 241
  • 259
  • 2
    This doesn't work for characters *inside* the ASCII range, that need quoting, like `%` and space (that last can be remedied with the `-s` flag) – Boldewyn Feb 07 '13 at 14:59
7

You can emulate javascript's encodeURIComponent in perl. Here's the command:

perl -pe 's/([^a-zA-Z0-9_.!~*()'\''-])/sprintf("%%%02X", ord($1))/ge'

You could set this as a bash alias in .bash_profile:

alias encodeURIComponent='perl -pe '\''s/([^a-zA-Z0-9_.!~*()'\''\'\'''\''-])/sprintf("%%%02X",ord($1))/ge'\'

Now you can pipe into encodeURIComponent:

$ echo -n 'hèllo wôrld!' | encodeURIComponent
h%C3%A8llo%20w%C3%B4rld!
Klaus
  • 666
  • 5
  • 14
7

Simple PHP option:

echo 'part-that-needs-encoding' | php -R 'echo urlencode($argn);'
Marcus Müller
  • 27,924
  • 4
  • 40
  • 79
Ryan
  • 3,607
  • 1
  • 23
  • 33
7

If you don't want to depend on Perl you can also use sed. It's a bit messy, as each character has to be escaped individually. Make a file with the following contents and call it urlencode.sed

s/%/%25/g
s/ /%20/g
s/ /%09/g
s/!/%21/g
s/"/%22/g
s/#/%23/g
s/\$/%24/g
s/\&/%26/g
s/'\''/%27/g
s/(/%28/g
s/)/%29/g
s/\*/%2a/g
s/+/%2b/g
s/,/%2c/g
s/-/%2d/g
s/\./%2e/g
s/\//%2f/g
s/:/%3a/g
s/;/%3b/g
s//%3e/g
s/?/%3f/g
s/@/%40/g
s/\[/%5b/g
s/\\/%5c/g
s/\]/%5d/g
s/\^/%5e/g
s/_/%5f/g
s/`/%60/g
s/{/%7b/g
s/|/%7c/g
s/}/%7d/g
s/~/%7e/g
s/      /%09/g

To use it do the following.

STR1=$(echo "https://www.example.com/change&$ ^this to?%checkthe@-functionality" | cut -d\? -f1)
STR2=$(echo "https://www.example.com/change&$ ^this to?%checkthe@-functionality" | cut -d\? -f2)
OUT2=$(echo "$STR2" | sed -f urlencode.sed)
echo "$STR1?$OUT2"

This will split the string into a part that needs encoding, and the part that is fine, encode the part that needs it, then stitches back together.

You can put that into a sh script for convenience, maybe have it take a parameter to encode, put it on your path and then you can just call:

urlencode https://www.exxample.com?isThisFun=HellNo

source

thecoshman
  • 7,772
  • 5
  • 52
  • 77
Jay
  • 39,240
  • 14
  • 62
  • 82
7

For those of you looking for a solution that doesn't need perl, here is one that only needs hexdump and awk:

url_encode() {
 [ $# -lt 1 ] && { return; }

 encodedurl="$1";

 # make sure hexdump exists, if not, just give back the url
 [ ! -x "/usr/bin/hexdump" ] && { return; }

 encodedurl=`
   echo $encodedurl | hexdump -v -e '1/1 "%02x\t"' -e '1/1 "%_c\n"' |
   LANG=C awk '
     $1 == "20"                    { printf("%s",   "+"); next } # space becomes plus
     $1 ~  /0[adAD]/               {                      next } # strip newlines
     $2 ~  /^[a-zA-Z0-9.*()\/-]$/  { printf("%s",   $2);  next } # pass through what we can
                                   { printf("%%%s", $1)        } # take hex value of everything else
   '`
}

Stitched together from a couple of places across the net and some local trial and error. It works great!

ataylor
  • 60,501
  • 18
  • 147
  • 181
Louis Marascio
  • 2,505
  • 22
  • 28
6

The question is about doing this in bash and there's no need for python or perl as there is in fact a single command that does exactly what you want - "urlencode".

value=$(urlencode "${2}")

This is also much better, as the above perl answer, for example, doesn't encode all characters correctly. Try it with the long dash you get from Word and you get the wrong encoding.

Note, you need "gridsite-clients" installed to provide this command.

Dylan
  • 406
  • 3
  • 7
5

Here's the node version:

uriencode() {
  node -p "encodeURIComponent('${1//\'/\\\'}')"
}
davidchambers
  • 20,922
  • 14
  • 68
  • 95
  • 1
    Won't this break if there are any other characters in the string that aren't valid between single quotes, like a single backslash, or newlines? – Stuart P. Bentley Dec 31 '16 at 19:09
  • Good point. If we're to go to the trouble of escaping all the problematic characters in Bash we might as well perform the replacements directly and avoid `node` altogether. I posted a Bash-only solution. :) – davidchambers Jan 01 '17 at 02:46
  • 2
    This variant found elsewhere on the page avoids the quoting issue by reading the value from STDIN: `node -p 'encodeURIComponent(require("fs").readFileSync(0))'` – Mark Stosberg Nov 19 '18 at 01:02
4

Python 3 based on @sandro's good answer from 2010:

echo "Test & /me" | python -c "import urllib.parse;print (urllib.parse.quote(input()))"

Test%20%26%20/me

Wolfgang Fahl
  • 12,097
  • 9
  • 75
  • 150
3

Ruby, for completeness

value="$(ruby -r cgi -e 'puts CGI.escape(ARGV[0])' "$2")"
k107
  • 13,813
  • 7
  • 55
  • 56
3

Another php approach:

echo "encode me" | php -r "echo urlencode(file_get_contents('php://stdin'));"
jan halfar
  • 57
  • 3
3

Here is a POSIX function to do that:

url_encode() {
   awk 'BEGIN {
      for (n = 0; n < 125; n++) {
         m[sprintf("%c", n)] = n
      }
      n = 1
      while (1) {
         s = substr(ARGV[1], n, 1)
         if (s == "") {
            break
         }
         t = s ~ /[[:alnum:]_.!~*\47()-]/ ? t s : t sprintf("%%%02X", m[s])
         n++
      }
      print t
   }' "$1"
}

Example:

value=$(url_encode "$2")
Steven Penny
  • 82,115
  • 47
  • 308
  • 348
3

Here is my version for busybox ash shell for an embedded system, I originally adopted Orwellophile's variant:

urlencode()
{
    local S="${1}"
    local encoded=""
    local ch
    local o
    for i in $(seq 0 $((${#S} - 1)) )
    do
        ch=${S:$i:1}
        case "${ch}" in
            [-_.~a-zA-Z0-9]) 
                o="${ch}"
                ;;
            *) 
                o=$(printf '%%%02x' "'$ch")                
                ;;
        esac
        encoded="${encoded}${o}"
    done
    echo ${encoded}
}

urldecode() 
{
    # urldecode <string>
    local url_encoded="${1//+/ }"
    printf '%b' "${url_encoded//%/\\x}"
}
nulleight
  • 606
  • 5
  • 11
3

What would parse URLs better than javascript?

node -p "encodeURIComponent('$url')"
Nestor Urquiza
  • 2,461
  • 24
  • 17
  • Out of op question scope. Not bash, not curl. Even if I'm sure works very good if node is available. – Cyrille Pontvieux Jul 27 '17 at 07:32
  • Why down-voting this and not the python/perl answers? Furthermore how this does not respond the original question "How to urlencode data for curl command?". This can be used from a bash script and the result can be given to a curl command. – Nestor Urquiza Jul 31 '17 at 11:54
  • I down-voted the others too. The question was how to do this in a bash script. If another language is used like node/js, python or perl, there is no need to use curl directly then. – Cyrille Pontvieux Aug 03 '17 at 14:35
  • There is no need to use curl if you have another language at your disposal but it does not mean you cannot use it. From the bash perspective curl is an external command just as node is. The solution I propose is to use node and curl inside a bash script. Yes you need a dependency but it is still bash. I am not proposing to do the whole work with node. Therefore this is a valid solution to the question "How to urlencode data *for curl* command?". The answer to the question is "urlencode the data with a node one-liner". – Nestor Urquiza Aug 04 '17 at 15:03
  • 4
    While I didn't bother to downvote, the problem with this command is that it requires data to be properly escaped for use in javascript. Like try it with single quotes and some backslash madness. If you want to use node, you better read stuff from stdin like `node -p 'encodeURIComponent(require("fs").readFileSync(0))'` – Michael Krelin - hacker Jan 06 '18 at 18:01
  • This is my favorite solution because my automation is a mix of bash and `Node.js`, so a dependency on `Node.js` is not a problem, and the solution is simply and readable. – Mark Stosberg Nov 19 '18 at 00:49
  • 1
    Be careful with @MichaelKrelin-hacker's solution if you are piping data in from STDIN make sure not to include a trailing newline. For example, `echo | ...` is wrong, while `echo -n | ...` suppresses the newline. – Mark Stosberg Nov 19 '18 at 00:57
  • @MarkStosberg, good point. If newline is undesirable, it may also be trimmed in javascript. – Michael Krelin - hacker Nov 19 '18 at 08:08
2

Here's a one-line conversion using Lua, similar to blueyed's answer except with all the RFC 3986 Unreserved Characters left unencoded (like this answer):

url=$(echo 'print((arg[1]:gsub("([^%w%-%.%_%~])",function(c)return("%%%02X"):format(c:byte())end)))' | lua - "$1")

Additionally, you may need to ensure that newlines in your string are converted from LF to CRLF, in which case you can insert a gsub("\r?\n", "\r\n") in the chain before the percent-encoding.

Here's a variant that, in the non-standard style of application/x-www-form-urlencoded, does that newline normalization, as well as encoding spaces as '+' instead of '%20' (which could probably be added to the Perl snippet using a similar technique).

url=$(echo 'print((arg[1]:gsub("\r?\n", "\r\n"):gsub("([^%w%-%.%_%~ ]))",function(c)return("%%%02X"):format(c:byte())end):gsub(" ","+"))' | lua - "$1")
Community
  • 1
  • 1
Stuart P. Bentley
  • 8,777
  • 7
  • 48
  • 77
1

Having php installed I use this way:

URL_ENCODED_DATA=`php -r "echo urlencode('$DATA');"`
ajaest
  • 525
  • 4
  • 13
1

This is the ksh version of orwellophile's answer containing the rawurlencode and rawurldecode functions (link: How to urlencode data for curl command?). I don't have enough rep to post a comment, hence the new post..

#!/bin/ksh93

function rawurlencode
{
    typeset string="${1}"
    typeset strlen=${#string}
    typeset encoded=""

    for (( pos=0 ; pos<strlen ; pos++ )); do
        c=${string:$pos:1}
        case "$c" in
            [-_.~a-zA-Z0-9] ) o="${c}" ;;
            * )               o=$(printf '%%%02x' "'$c")
        esac
        encoded+="${o}"
    done
    print "${encoded}"
}

function rawurldecode
{
    printf $(printf '%b' "${1//%/\\x}")
}

print $(rawurlencode "C++")     # --> C%2b%2b
print $(rawurldecode "C%2b%2b") # --> C++
Community
  • 1
  • 1
Ray Burgemeestre
  • 1,220
  • 8
  • 12
1

This nodejs-based answer will use encodeURIComponent on stdin:

uriencode_stdin() {
    node -p 'encodeURIComponent(require("fs").readFileSync(0))'
}

echo -n $'hello\nwörld' | uriencode_stdin
hello%0Aw%C3%B6rld

masterxilo
  • 1,949
  • 1
  • 24
  • 29
0

The following is based on Orwellophile's answer, but solves the multibyte bug mentioned in the comments by setting LC_ALL=C (a trick from vte.sh). I've written it in the form of function suitable PROMPT_COMMAND, because that's how I use it.

print_path_url() {
  local LC_ALL=C
  local string="$PWD"
  local strlen=${#string}
  local encoded=""
  local pos c o

  for (( pos=0 ; pos<strlen ; pos++ )); do
     c=${string:$pos:1}
     case "$c" in
        [-_.~a-zA-Z0-9/] ) o="${c}" ;;
        * )               printf -v o '%%%02x' "'$c"
     esac
     encoded+="${o}"
  done
  printf "\033]7;file://%s%s\007" "${HOSTNAME:-}" "${encoded}"
}
0

For one of my cases I found that the NodeJS url lib had the simplest solution. Of course YMMV

$ urlencode(){ node -e "console.log(require('url').parse(process.argv.slice(1).join('+')).href)" "$@"; }

$ urlencode "https://example.com?my_database_has=these 'nasty' query strings in it"
https://example.com/?my_database_has=these%20%27nasty%27%20query%20strings%20in%20it
Bruno Bronosky
  • 54,357
  • 9
  • 132
  • 120
  • 1
    why the downvote? The solution might be inefficient, but definitely correct and not hand-crafted like others... – masterxilo Feb 11 '21 at 16:24
0

There is an excellent answer from Orwellophile, which does include a pure bash option (function rawurlencode), which I've used on my website (shell based CGI script, large number of URLS in response to search requests). The only draw back was high CPU during peak time.

I've found a modified solution, leverage bash "global replace" feature. With this solution processing time for url encode is 4X faster. The solution identify the characters to be escaped, and uses the "global replace" operator (${var//source/replacement}) to process all substitutions. The speed up is clearly from using bash internal loops, over explicit loop.

Performance: On core i3-8100 3.60Ghz. Test case: 1000 URL from stack overflow, similar to this ticket: "https://stackoverflow.com/questions/296536/how-to-urlencode-data-for-curl-command".

  • Existing Solution: 0.807 sec
  • Optimized Solution: 0.162 sec (5X speedup)
url_encode()
{
    local key="${1}" varname="${2:-_rval}" prefix="${3:-_ENCKEY_}"
    local unsafe=${key//[-_.~a-zA-Z0-9 ]/} 
    local -i key_len=${#unsafe}
    local ch ch1 ch0

    while [ "$unsafe" ] ;do
        ch=${unsafe:0:1}
        ch0="\\$ch"
        printf -v ch1 '%%%02x' "'$ch'" 
        key=${key//$ch0/"$ch1"}
        unsafe=${unsafe//"$ch0"}
    done
    key=${key// /+} 

    REPLY="$key"
    # printf "%s" "$REPLY"
    return 0
}

As a minor extra, it uses '+' to encode the space. Slightly more compact URL.

Benchmark:

function t {
    local key
    for (( i=1 ; i<=$1 ; i++ )) do url_encode "$2" kkk2 ; done
    echo "K=$REPLY"
}

t 1000 "https://stackoverflow.com/questions/296536/how-to-urlencode-data-for-curl-command"

dash-o
  • 11,968
  • 1
  • 4
  • 29
0

Note

  • These functions are NOT made to encode URL's data but URLs.
  • Put the URLs in a file in a manner of one per line.
#!/bin/dash

replaceUnicodes () { # $1=input/output file
    if ! mv -f "$1" "$1".tmp 2>/dev/null; then return 1; fi
    output="$1" awk '
    function hexValue(chr) {
        if(chr=="0") return 0; if(chr=="1") return 1; if(chr=="2") return 2; if(chr=="3") return 3; if(chr=="4") return 4; if(chr=="5") return 5;
        if(chr=="6") return 6; if(chr=="7") return 7; if(chr=="8") return 8; if(chr=="9") return 9; if(chr=="A") return 10;
        if(chr=="B") return 11; if(chr=="C") return 12; if(chr=="D") return 13; if(chr=="E") return 14; return 15 }
    function hexToDecimal(str,  value,i,inc) {
        str=toupper(str); value=and(hexValue(substr(str,length(str),1)),15); inc=1;
        for(i=length(str)-1;i>0;i--) {
            value+=lshift(hexValue(substr(str,i,1)),4*inc++)
        } return value }
    function toDecimal(str, value,i) {
        for(i=1;i<=length(str);i++) {
            value=(value*10)+substr(str,i,1)
        } return value }
    function to32BE(high,low) {
        # return 0x10000+((high-0xD800)*0x400)+(low-0xDC00) }
        return lshift((high-0xD800),10)+(low-0xDC00)+0x10000 }
    function toUTF8(value) {
        if(value<0x80) { 
            return sprintf("%%%02X",value)
        } else if(value>0xFFFF) {
            return sprintf("%%%02X%%%02X%%%02X%%%02X",or(0xF0,and(rshift(value,18),0x07)),or(0x80,and(rshift(value,12),0x3F)),or(0x80,and(rshift(value,6),0x3F)),or(0x80,and(rshift(value,0),0x3F)))
        } else if(value>0x07FF) {
            return sprintf("%%%02X%%%02X%%%02X",or(0xE0,and(rshift(value,12),0x0F)),or(0x80,and(rshift(value,6),0x3F)),or(0x80,and(rshift(value,0),0x3F)))
        } else { return sprintf("%%%02X%%%02X",or(0xC0,and(rshift(value,6),0x1F)),or(0x80,and(rshift(value,0),0x3F))) }
    }
    function trap(str) { sub(/^\\+/,"\\",str); return str }
    function esc(str) { gsub(/\\/,"\\\\",str); return str }
    BEGIN { output=ENVIRON["output"] }
    {
        finalStr=""; while(match($0,/[\\]+u[0-9a-fA-F]{4}/)) {
            p=substr($0,RSTART,RLENGTH); num=hexToDecimal(substr(p,RLENGTH-3,4));
            bfrStr=substr($0,1,RSTART-1); $0=substr($0,RSTART+RLENGTH,length($0)-(RSTART+RLENGTH-1));
            if(surrogate) {
                surrogate=0;
                if(RSTART!=1 || num<0xD800 || (num>0xDBFF && num<0xDC00) || num>0xDFFF) {
                    finalStr=sprintf("%s%s%s%s",finalStr,trap(highP),bfrStr,toUTF8(num))
                } else if(num>0xD7FF && num<0xDC00) {
                    surrogate=1; high=num; finalStr=sprintf("%s%s",finalStr,trap(highP))
                } else { finalStr=sprintf("%s%s",finalStr,toUTF8(to32BE(high,num))) }
            } else if(num>0xD7FF && num<0xDC00) {
                surrogate=1; highP=p; high=num; finalStr=sprintf("%s%s",finalStr,bfrStr)
            } else { finalStr=sprintf("%s%s%s",finalStr,bfrStr,toUTF8(num)) }
        } finalStr=sprintf("%s%s",finalStr,$0); $0=finalStr

        while(match($0,/[\\]+U[0-9a-fA-F]{8}/)) {
            str=substr($0,RSTART,RLENGTH); gsub(esc(str),toUTF8(hexToDecimal(substr(str,RLENGTH-7,8))),$0)
        }
        while(match($0,/[\\]*&#[xX][0-9a-fA-F]{1,8};/)) {
            str=substr($0,RSTART,RLENGTH); idx=index(str,"#");
            gsub(esc(str),toUTF8(hexToDecimal(substr(str,idx+2,RLENGTH-idx-2))),$0)
        }
        while(match($0,/[\\]*&#[0-9]{1,10};/)) {
            str=substr($0,RSTART,RLENGTH); idx=index(str,"#");
            gsub(esc(str),toUTF8(toDecimal(substr(str,idx+1,RLENGTH-idx-1))),$0)
        }
        printf("%s\n",$0) > output
    }' "$1".tmp
    rm -f "$1".tmp
}

replaceHtmlEntities () { # $1=input/output file
    if ! mv -f "$1" "$1".tmp 2>/dev/null; then return 1; fi
    sed 's/%3[aA]/:/g; s/%2[fF]/\//g; s/&quot;/%22/g; s/&lt;/%3C/g; s/&gt;/%3E/g; s/&nbsp;/%A0/g; s/&cent;/%A2/g; s/&pound;/%A3/g; s/&yen;/%A5/g; s/&copy;/%A9/g; s/&reg;/%AE/g; s/&amp;/\&/g; s/\\*\//\//g' "$1".tmp > "$1"
    rm -f "$1".tmp
}


# "od -v -A n -t u1 -w99999999"
# "hexdump -v -e \47/1 \42%d \42\47"
# Reminder :: Do not encode (, ), [, and ].
toUTF8Encoded () { # $1=input/output file
    if ! mv -f "$1" "$1".tmp 2>/dev/null; then return 1; fi
    if [ -s "$1".tmp ]; then
        # od -A n -t u1 -w99999999 "$1".tmp | \
        hexdump -v -e '/1 "%d "' "$1".tmp | \
        output="$1" awk 'function hexDigit(chr) { if((chr>47 && chr<58) || (chr>64 && chr<71) || (chr>96 && chr<103)) return 1; return 0 }
        BEGIN { output=ENVIRON["output"] }
        {   for(i=1;i<=NF;i++) {
                flushed=0; c=$(i);
                if(c==13) { if($(i+1)==10) i++; printf("%s\n",url) > output; url=""; flushed=1
                } else if(c==10) { printf("%s\n",url) > output; url=""; flushed=1
                } else if(c==37) {
                    if(hexDigit($(i+1)) && hexDigit($(i+2))) {
                        url=sprintf("%s%%%c%c",url,$(i+1),$(i+2)); i+=2
                    } else { url=sprintf("%s%%25",url) }
                } else if(c>32 && c<127 && c!=34 && c!=39 && c!=96 && c!=60 && c!=62) {
                    url=sprintf("%s%c",url,c)
                } else { url=sprintf("%s%%%02X",url,c) }
            } if(!flushed) printf("%s\n",url) > output
        }'
    fi
    rm -f "$1".tmp
}

Call replaceUnicodes() --> replaceHtmlEntities() --> toUTF8Encoded()

Darkman
  • 763
  • 3
  • 8