How to decode PostgreSQL bytea column hex to int16/uint16 in r?

Question

I have some image data stored in a PostgreSQL database table column as bytea. I also have metadata about the data for use in interpreting it, relevant ones being image dimensions and class. Classes include int16, uint16. I cannot find any information on interpreting signed/unsigned ints correctly in R.

I am using RPostgreSQL to pull the data into R and I want to view the image in R.

MWE:

# fakeDataQuery <- dbGetQuery(conn, 
#     'select byteArray, ImageSize, ImageClass from table where id = 1')

# Example 1 (no negative numbers)
# the actual byte array shown in octal sequences in pgadmin (1.22.2) Query Output is: 
# "\001\000\002\000\003\000\004\000\005\000\006\000\007\000\010\000\011\000"

# but RPostgreSQL returns the hex-encoded version:
byteArray <- "\\x010002000300040005000600070008000900"
ImageSize <- c(3, 3, 1)
ImageClass <- 'int16'

# expected result 
> array(c(1,2,3,4,5,6,7,8,9), dim=c(3,3,1))
#   , , 1
#
#        [,1] [,2] [,3]
#[1,]    1    4    7
#[2,]    2    5    8
#[3,]    3    6    9

# Example 2: (with negtive numbers)
byteArray <- "\\xffff00000100020003000400050006000700080009000a00"
ImageSize <- c(3, 4, 1)
ImageClass <- 'int16'
# expectedResult 
> array(c(-1,0,1,2,3,4,5,6,7,8,9,10), dim=c(3,4,1))
#, , 1
#
#     [,1] [,2] [,3] [,4]
#[1,]   -1    2    5    8
#[2,]    0    3    6    9
#[3,]    1    4    7   10

What I've tried:

The bytea data from PostgreSQL is a long character string of digits encoded as "hex", which you can tell by the \\x pre-pended to it (I believe there is an extra \ for escaping the existing one?): https://www.postgresql.org/docs/9.1/static/datatype-binary.html (see: section 8.4.1. 'bytea Hex format')

Decode 'hex' back to the original type ('int16' based on ImageClass)

Per the same url above, hex encoding uses '2 hexadecimal digits per byte'. So I need to split the encoded byteArray into the appropriate length substrings, see: this link

# remove the \\x hex encoding indicator(s) added by PostgreSQL
byteArray <- gsub("\\x", "", x = byteArray, fixed=T)

l <- 2  # hex digits per byte (substring length)
byteArray <- strsplit(trimws(gsub(pattern = paste0("(.{",l,"})"), 
                                  replacement = "\\1 ", 
                                  x = byteArray)), 
                      " ")[[1]]

# for some reason these appear to be in the opposite order than i expect
# Ex: 1 is stored as '0100' rather than '0001'
# so reverse the digits (int16 specific)
byteArray <- paste0(byteArray[c(F,T)],byteArray[c(T,F)])

# strtoi() converts a vector of hex values given a decimal base
byteArray <- strtoi(byteArray, 16L)

# now make it into an n x m x s array,
# e.g., 512 x 512 x (# slices)
V = array(byteArray, dim = ImageSize)

There are two problems with this solution:

It does not work with signed types, so negative integer values will be interpreted as unsigned values (e.g., 'ffff' is -1 (int16) but 65535 (uint16) and strtoi() will return 65535 always).
It currently is coded for int16 only, and would need a little extra code to work with other types (e.g., int32, int64)

Anyone have a solution that would work with signed types?

Ralf Stubner · Accepted Answer · 2018-06-29T09:29:43.283

2

You can start with this conversion function, substitute a faster strsplit and use readBin on the result:

byteArray <- "\\xffff00000100020003000400050006000700080009000a00"

## Split a long string into a a vector of character pairs
Rcpp::cppFunction( code = '
CharacterVector strsplit2(const std::string& hex) {
  unsigned int length = hex.length()/2;
  CharacterVector res(length);
  for (unsigned int i = 0; i < length; ++i) {
    res(i) = hex.substr(2*i, 2);
  }
  return res;
}')

## A function to convert one string to an array of raw
f <- function(x)  {
  ## Split a long string into a a vector of character pairs
  x <- strsplit2(x)
  ## Remove the first element, "\\x"
  x <- x[-1]
  ## Complete the conversion
  as.raw(as.hexmode(x))
}

raw <- f(byteArray)
# int16
readBin(con = raw,
        what = "integer",
        n = length(raw) / 2,
        size = 2,
        signed = TRUE,
        endian = "little")
# -1  0  1  2  3  4  5  6  7  8  9 10

# uint16
readBin(con = raw,
        what = "integer",
        n = length(raw) / 2,
        size = 2,
        signed = FALSE,
        endian = "little")
# 65535     0     1     2     3     4     5     6     7     8     9    10

# int32
readBin(con = raw,
        what = "integer",
        n = length(raw) / 4,
        size = 4,
        signed = TRUE,
        endian = "little")
# 65535 131073 262147 393221 524295 655369

This won't work for uint32 and (u)int64, though, since R uses int32 internally. However, R can also use numerics to store integers below 2^52. So we can use this:

# uint32
byteArray <- "\\xffffffff0100020003000400050006000700080009000a00"
int32 <- readBin(con = f(byteArray),
                 what = "integer",
                 n = length(raw) / 4,
                 size = 4,
                 signed = TRUE,
                 endian = "little")

ifelse(int32 < 0, int32 + 2^32, int32)
# 4294967295     131073     262147     393221     524295     655369

And for gzip compressed data:

# gzip
byteArray <- "\\x1f8b080000000000000005c1870100200800209a56faffbd41d30dd3b285e37a52f9d033018818000000"
con <- gzcon(rawConnection(f(byteArray)))
readBin(con = con,
        what = "integer",
        n = length(raw) / 2,
        size = 2,
        signed = TRUE,
        endian = "little")
close(con = con)

Since this is a real connection, we have to make sure to close it.

edited Jun 29 '18 at 09:29

answered Jun 23 '18 at 09:52

Ralf Stubner

24,387
3
31
63

What if the byteArray is gzip compressed? e.g. `\\x1f8b080000000000000005c1870100200800209a56faffbd41d30dd3b285e37a52f9d033018818000000` – Brian D Jun 25 '18 at 17:18
1

@BrianD Interesting question. You can use `readBin(con = gzcon(rawConnection(f(byteArray))), ...)`. – Ralf Stubner Jun 25 '18 at 17:30
I was trying to use `gzfile(f(byteArray))` (per this thread: https://stackoverflow.com/questions/5764499/decompress-gz-file-using-r), and attempted using `memDecompress()` and was wondering why none of them worked. `gzcon` and `rawConnection` were the missing links. Thanks! – Brian D Jun 25 '18 at 18:20
1

so it turns out that the `strsplit()` in `f()` breaks with a very large vector (e.g., 512x512). It pegs the CPU, never returns, and I have to kill the process or restart R. However, if I use the version of the `strsplit()` in my question it works perfectly fine. – Brian D Jun 26 '18 at 17:36
1

I was doing some microbenchmarking of compressed vs. uncompressed byteArrays and found that those connections persist... eventually bumping up against R's limit of 128 connections, and causing an error. So I split that out into `c – Brian D Jun 26 '18 at 20:03
1

@BrianD Thanks, I have incorporated your suggestions. – Ralf Stubner Jun 26 '18 at 21:15
1

Apparently there's an even faster version of the string split, see @GSee answer benchmarking at: https://stackoverflow.com/questions/2247045/chopping-a-string-into-a-vector-of-fixed-width-character-elements – Brian D Jun 28 '18 at 20:50
@BrianD That version is so crazy I had to include it! – Ralf Stubner Jun 28 '18 at 20:59
I like it. But I wonder if there is a way to do this without `f()`, just process the string without splitting it into an array of two-character-length hexes. This workflow is more inefficient than I was expecting with very large strings. For a single string with 518k characters it takes 90ms (10 strsplit, 40 paste0, 10 subset, 10 as.hexmode, 10 readBin). But if there was a way to just go straight to `readBin()`... – Brian D Jun 28 '18 at 22:10
1

@BrianD One can go to C++. See the update. Here I only go the first step, i.e. implement the string splitting. One could probably do more of the conversion in C++, but I am not sure it is worth it. – Ralf Stubner Jun 29 '18 at 09:35
Nice! That little C++ function cut down the processing time by half. – Brian D Jul 05 '18 at 17:43

How to decode PostgreSQL bytea column hex to int16/uint16 in r?

1 Answers1

Linked