88

Consider the following expression.

$1631310734315390891207403279946696528907777175176794464896666909137684785971138$ $2649033004075188224$ This is a $98$ decimal digit number. This can be represented as $424^{37}$ which has just 5 digits.

or consider this number:

$1690735149233357049107817709433863585132662626950821430178417760728661153414605$ $2484795771275896190661372675631981127803129649521785142469703500691286617000071$ $8058938908895318046488014239482587502405094704563355293891175819575253800433524$ $5277559791129790156439596789351751130805731546751249418933225268643093524912185$ $5914917866181252548011072665616976069886958296149475308550144566145651839224313$ $3318400757678300223742779393224526956540729201436933362390428757552466287676706$ $382965998179063631507434507871764226500558776264$

This $200$ decimal digit number can be simply expressed as $\log_e 56$ when we discard first $6$ numbers and then consider first $200$ digits.

Now the question is, is it possible to represent any and every huge random number using very few characters as possible, theoretically.

...Also, is there any standard way to reduce it mathematically?

Harsh Kumar
  • 2,828
  • 4
  • 20
  • 41
endrendum
  • 233
  • 1
  • 6
  • 9

5 Answers5

164

No. The problem is very simple: there are more huge numbers than abbreviated forms. Suppose that I allow you to use the numerals 0-9, the English alphabet, and spaces to describe whatever numbers you want; that's still only $37$ characters. There are only $37^{140}$ expressions you can write down using these characters which is $140$ characters or shorter (the length of a tweet!), so there are only $37^{140}$ numbers that you could conceivably tweet to someone else. In particular, since $\log_{10} 37^{140} = 219.5$, this implies that there is at least one number with $220$ digits which is impossible to tweet to someone else without using something other than numerals and English.

All right, I'll be even more generous: I'll let you use all $127$ ASCII characters. That means there are $127^{140}$ possible tweets, some of which mathematically describe some kind of number, and since $\log_{10} 127^{140} = 294.5$, there is at least one number with $295$ digits which is impossible to tweet to someone else (again, assuming that Twitter uses ASCII). I could give you the entire Unicode alphabet and you still won't be able to tweet, for example, the kind of prime numbers that public key cryptosystems use.

These kinds of questions are studied in computer science and information theory; you will find the word Kolmogorov complexity tossed around, as well as the word entropy.

badp
  • 1,176
  • 1
  • 16
  • 30
Qiaochu Yuan
  • 359,788
  • 42
  • 777
  • 1,145
  • 34
    In fact, one gets a result potentially more frightening: if I keep increasing the number of digits I run into more and more numbers, far fewer of which could possibly be tweeted. For example, more than 99% of all numbers with 297 digits cannot possibly be tweeted. This leads to a very basic and inescapable conclusion: the average number is hard to describe. People often say instead that the average string has high Kolmogorov complexity. – Qiaochu Yuan Aug 26 '10 at 21:56
  • 35
    +1, not just for using Twitter as a universal standard of "very few characters", but also for the surprising feat of never mentioning the word "pigeonhole" at all. –  Aug 26 '10 at 22:02
  • 2
    Aside, entire unrelated to mathematics: Twitter lets you use all Unicode characters, not just ASCII. Take a glance at [Mona Tweeta](http://www.flickr.com/photos/quasimondo/3518306770/page2/)/ [others](http://stackoverflow.com/questions/891643/twitter-image-encoding-challenge). – ShreevatsaR Aug 27 '10 at 04:08
  • 6
    @ShreevatsaR: all right! There are apparently 95221 characters in Unicode 3.2. I leave it as an exercise to the reader to repeat the above computation with this number. – Qiaochu Yuan Aug 27 '10 at 04:45
  • 5
    @Qiaochu, this is a really great answer. I would rank it in the top 5 I have seen on this site. It is elegant, well presented, and of course, accurate. In short: +1 – BBischof Aug 28 '10 at 08:35
  • Twitter allows all unicode codepoints, by the way :) It doesn't matter because no finite amount of bits per character are going to be enough, obviously. – badp Oct 26 '10 at 10:45
  • 3
    ASCII has 128 characters (0-127), BTW. –  Oct 30 '10 at 05:53
  • 1
    You can, and should extend this, to Unicode characters. While it would result in quite a larger number of numbers that can be tweeted. Alas, as the [Frivolous Theorem of Arithmetic](http://mathworld.wolfram.com/FrivolousTheoremofArithmetic.html) tells us, this is still a mighty small number of numbers. – Asaf Karagila Nov 03 '11 at 23:43
  • See also, my comment about pointers and the reasonable assumption that the character space is pragmatically unbounded. These are the kinds of problems that Internet engineers think of regularly that they routinely route around in their designs. – MrGomez Nov 04 '11 at 18:44
  • 7
    Another intresting fact: display of your computer can show only a finite number of different pictures. Indeed - amount of pixels is finite (for example $1920\cdot 1080$), amount of different colors is finite (for example $2^{32}$). So your computer can show just $(1920\cdot1080)^{2^{32}}$ different pictures. – Norbert Nov 19 '11 at 08:57
  • 1
    @Norbert: This raises further questions. What is the probability that a given array of pixels is an understandable image to human beings? From that percentage, remove the 50% pornograpgy and LOLcats. When will we finish using the image space left, and no more new HD data will be created? :) – Asaf Karagila Sep 12 '13 at 06:36
  • @AsafKaragila retina display is already annonced, so this process will never end up – Norbert Sep 12 '13 at 06:54
  • @Norbert: One can ask an information theory sort of question, if the amount of "immediately observable" information in a 4K and HD image are the same, or at least the increase of the former gets smaller as we step into larger [flat] photos; when will be the point we don't have any new images to show? :-) – Asaf Karagila Sep 12 '13 at 07:23
  • 1
    @Asaf your estimate of only 50% pornography and LOLcats is [very, very optimistic](http://xkcd.com/468/). – This site has become a dump. Sep 15 '13 at 20:59
  • @QiaochuYuan Is this only for lack of operators? For example you are only using exponentiation here. If allowed to use offsets (+, -) and various other operators, would it not become easier to represent more large numbers in a smaller data size? – Asad Saeeduddin Mar 15 '17 at 09:40
  • @Asad: I'm using exponentiation to count numbers of possible expressions, which is orthogonal to what operations I'm allowing you to describe with those expressions. Any fixed language in which each expression has a unique meaningful value succumbs to the counting argument above; it is extremely general. I can let you use Ackermann functions or whatever. – Qiaochu Yuan Mar 15 '17 at 18:47
  • Thanks, makes sense now. I wasn't thinking of it in terms of the pigeonhole principle before but it seems fairly obvious now. – Asad Saeeduddin Mar 16 '17 at 13:07
28

If follows from the pigeonhole principle that any lossless compression algorithm that decreases the size of at least one number must increase the size of another - else the compression wouldn't be invertible. For deeper results consult any textbook algorithmic information theory (Kolmogorov complexity), e.g. Li; Vitanyi. An introduction to Kolmogorov complexity and its applications.

Bill Dubuque
  • 257,588
  • 37
  • 262
  • 861
5

As in other answers, at the very least a pigeon-hole principle shows that "most" numbers cannot be "described" in fewer characters than their decimal (or whatever-you-pick) expression...

To my mind, the relevant developed body of ideas is "Solomonov-Chaitin-Kolmogorov" complexity, which is about descriptional (or program-length) complexity, rather than run-time "complexity".

This does remind one of the "smallest boring number" pseudo-paradox, which argues that the first non-interesting number has some interest because it is the first...

The bottom line is that "size" of numbers is not reliably comparable to "descriptional complexity", in general, although most large-ish numbers are also descriptionally complex, by pigeon-hole.

There is a book by Li and Vitanyi which is not only authoritative, but fascinatingly readable...

paul garrett
  • 46,394
  • 4
  • 79
  • 149
1

The way I understand this question, the answer is "yes." Simply write the number in, say, hexadecimal notation, and it will always be shorter than in decimal (if it is >=100000). If you are looking for the shortest representation, however, that is well defined only if you specify exactly what counts as a valid representation of a number. After all, every number you come up with can technically be represented in just one digit if you choose a sufficiently large base (at least one higher than the number itself).

About the standard way to reduce it: I'd say the decimal system is the standard way of reducing numbers from just writing the appropriate number of dashes. :-) In the sense pointed out by Bill Dubuque, there is no representation that is more space efficient in principle, that is if you transfer the problem to a more abstract domain. If, instead, you are looking for the shortest representation using a given set of mathematical symbols, then no, there is no standard way to find it. I also suspect no algorithm can do better than brute force search (that is, first check all numbers representable using 1 character, then those representable in 2 characters, and so on). In any case, nobody would write a number as "loge 56 (log 56 base e) when we discard first 6 numbers and then consider first 200 digits," so no conceivable set of standard symbols will account for that anyway.

  • 2
    The stipulation was not "less characters," but "very few," and regardless of the size of the alphabet this is not possible. Your last comment also misses the point: that sentence can be turned into a perfectly valid description of a small Turing machine which prints a large number, but the point is that for most large numbers one cannot find such small Turing machines. – Qiaochu Yuan Aug 27 '10 at 17:00
  • 5
    To drive Qiaochu's point home: the OP asks for "every huge number" to have a compact representation; on the other hand, you say yourself that using a different, larger base has a ceiling after which representation becomes impractical. As for "every number you come up with can technically be represented in just one digit if you choose a sufficiently large base", you have merely shifted the information content from the number to the base. How does one represent the base then, afterwards? – J. M. ain't a mathematician Aug 27 '10 at 23:32
  • @Qiaochu Yuan: Are you sure "very few" can only be understood as "a finite number"? And the Turing machine you mention will surely be larger (by any decent measure) than the decimal representation of the number itself! My comment just expresses the belief that the same is true for any standardized formalization of that sentence. @J. Mangaldan: My point was just that there is no well-defined notion of "shortest representation." Aren't you essentially making the same claim? – Sebastian Reichelt Aug 28 '10 at 00:16
  • the Turing machine description is compact in the sense that one can easily replace 200 by any positive integer. The resulting, say, 400,000-digit number can be described by a Turing machine much shorter than 400,000 characters long, but the point is that this can't always be possible. As for your objection to J. Mangaldan's answer, the more important point here is that we can make a choice of encoding relative to which a shortest representation exists, and regardless of this choice of encoding the arguments given in the answers continue to hold. – Qiaochu Yuan Aug 28 '10 at 01:35
1

It follows that, for any singular or batched compression scheme, this and this are correct.

However, it equally follows that, agnostic to efficiency, for any arbitrary number that can be reduced to some relation in the infinite space of all possible relations, the answer is a solid yes. For example, we understand your current notation to be $log_e56$ units away from 0 in a discrete or continuous sequence of real numbers, and we might exploit this property to build a similar relational partitioning scheme expressed in fewer bytes of local data.

The difference is, in the case of infinite heuristic space, you're hiding rules of greater complexity than your input data behind the scenes. This can work very well in limited computational contexts (for example, in situations where it's more expensive to send a message over the wire than to perform a computation locally), but in the domain of pure expression, it's definably less efficient.

Putting this concept into practice, you can always get away with elegant, tweet-length expressions of your data provided the reference is not contained within the text body. The right approach is to provide your simplification to establish context and provide a pointer (ie, in the form of a link) to the reference. As long as it can be reasonably assumed that the space for such pointers is unbounded (ie, by providing an ever-widening character space), you'll always be able to provide a necessarily terse expression.

MrGomez
  • 141
  • 6
  • 1
    This is something that was definitely on my mind. The key is "hiding rules of greater complexity than your input data behind the scenes. " For example I can tweet, "Distance between where I am standing to statue of liberty." The answer varies based on my current location. – endrendum Nov 05 '11 at 20:51