Encryption Tutorial using JavaScript

(An examination of basic techniques.)

And an example of the RASCAL system using described techniques.

Features of RASCAL coding examples:
  • It does not require servers or compilers, and you have the full source that executes on your browser.
  • It is NOT a stream cipher - it is much stronger and has none of the problems.
  • Designed for finite messages such as E-mail (about 100,000 bytes max.)
  • Encryption key-length up to 55,000 bits using normal keyboard characters.
  • Cryptographically Secure (CS) encryption without a CS PRNG. I give real alternatives, with examples.
  • Maximum confusion and diffusion (stronger than Feistel.)
    • Substitution generator for keyboard data expanding its dynamic range to full 256-bit.
    • Random Shuffle of bit and byte data across entire message, and not just within small blocks.
  • Maximum entropy is achieved because the user input key determines what happens - not formulaic code.
  • May be easily adapted to block methodologies
Contest
Prime Number
Factorial calc
RASCAL0 (Overhead)
RASCALK (CS keys)
RASCAL4 (Ultimate)
randlib.js
Hash0 test page
Hash2 test page
SubGen test page
Shuffle test page
Case for larger blocks
Local CSS
Rand1 test
Rand2ss test
Rand3 test
Rand7 test
Rand8 test
Rand9 test
Rand11 test
Rand14 test
Rand15 test

Revision History
*01 July 2008 - 3.4.2 release. Corrected error in bit shuffle code.
*01 June 2008 - 3.4.1 release. Tweak of the code.
*01 April 2008 - 3.4 release. Final version of offered code.
*01 February 2008 - 3.3 release. Major tweak of Code.
*02 January 2008 - 3.2 release. Code and examples tightened.
*10 November 2007 - Maintenance release. Main addition was a bit-level shuffler.
*22 October 2007 - 3.1 release. Normal maintenance release.
*01 October 2007 - 3rd release. Strengthened in almost every way. More examples, and more discussion.
*10 August 2007 - Rand1 turned into near-CS condition.
*23 July 2007 - 2nd release version with improved code, and stronger encryption. And basic RC4 example.
*08 July 2007 - Final prep, and improved RNGs.
*13 June 2007 - 1st release version.
*05 May 2007 - Final cleanup, getting ready for release.
*10 March 2007 - basic offering...

Why JavaScript?

I present ideas and examples, and to see them in action you do not need any compilers or access to a server - all code works within your browser. I have paid attention to porting the code by not making extensive use of JavaScript-specific features, but there are things that are unique to JavaScript (unfortunately.) But you should be able to see the example I am making, and then make it work in your language of choice.

This approach is for finite messages like E-mail, and is NOT a stream cipher approach. It is interesting that the JavaScript examples actually may be used to encrypt discreet text messages very effectively, but everyone wants to use their special language, or port the code to a server (which actually decreases security.) Have fun with some new ideas - ideas that are NOT stream ciphers, but are actually more effective for discreet text files.

The "stream" cipher is used for on-the-fly encryption for things like secure digital voice networks. That is where the money is, and where everyone spends their time. For discreet messages like E-mail, stream ciphers are really not appropriate simply because of the key-types (weird numbers), and key lengths. But people cannot get their minds away from it, and represent some rather narrow thinking.

The approach I use is good for the encryption of single messages, or extremely large files, and can be adapted to block-encryption methods. And the "keys" come from the keyboard and your key-string can be of any length, and is normal characters. A key does not have to be some silly number, but can be something like "test it, dude" without the quotes. Have fun...

What is encryption?

Basically it is altering the text of a message in such a way as to make it more difficult to read. We would hope that it will keep others from ever reading our messages, but that may be difficult to achieve in practice. Anything that can be done may be undone. In WWII the Allies were more successful because they broke the "unbreakable" German Enigma code, and the Japanese JN25 code. We read their "secret" messages. We figured out what they did.

To encrypt a character we must somehow operate on it to change its value. We must be aware that whatever we do to change the character, we must be able to undo - this is the general weakness of ALL machine-generated encryption algorithms: if I can discover what you did to change the original character, I can then undo what you did and read your message (given enough time). What a good encryption algorithm does is to change the original text in such a way that it is terribly difficult to figure out, and, therefore, to undo. Some approaches are so difficult (even if you have the source code), that you cannot solve them in your lifetime even with the most modern computers. In that case, it would be a good encryption scheme because you do not have the time to break it.

What can we do to alter a character of text? The list of things is actually rather long, but most of them are not very good, as we shall discover. They fall into four basic categories - substitution, mathematical, logical (Boolean), and bit/byte manipulation...


Data Preparation (obfuscation)

Cryptographically Secure (CS) encryption may be realized in different ways. In stream encryption it requires a CS Random Number Generator (CSG) to encrypt the text of the message. But this is not true if we use some other (non-stream) approach. There is a much ignored area of Data Obfuscation (DO) that is important to realizing CS encryption without the need for a CSG. This is heresy to many people, but that is because their mind is locked into "stream-think"! See Appendix 2 for an analysis...

When dealing with keyboard data to be encrypted, it quickly becomes obvious that there are problems with that data, such as only using 96 different values with concentrations of such things as space characters. The "range" of the data is limited, and its distribution is very irregular. It would be nice if we could smooth this out a bit before encryption - "prepare" the data, as it were. You might Google data obfuscation, and data entropy to learn more. Also see links at the bottom of the page.

If the obfuscation is good enough, then we can use it to achieve Cryptographically Secure (CS) encryption without using a Cryptographically Secure Pseudo Random Number Generator (CSPRNG, or CSG.) This can give rise to a flexibility that cannot be realized with a CSG. More on this later.

It would also be nice if we could shuffle the data before encryption so the order of the data is hidden - OR, shuffle it before and after encryption to hide which key values were applied to which text values. This requires the ability to manipulate the data, and then to un-do it to un-mix the data.

After a little work, I figured out how to apply, and then un-apply techniques such as these to the data. Apply a rotating substitution array of characters to hide things like spaces (distribution of the data), and mix bits and bytes by shuffling the data according to a random key. Following are descriptions of these functions. And each of them can be un-done. See the description of RASCAL0 for a working example of these features. With these functions I am both obfuscating the data, and increasing its entropy - before encryption. And if I am successful, we can use a simple Lagged Fibonacci Generator (LFG) to "encrypt" the data, and it shall still be CS!

Substitution Arrays (SubGen)

Again, normal keyboard text data is 7-bits. This presents the problem of exposing the high-order key bit in the encrypted output, which is not good. (If you XOR 7-bit data with an 8-bit key, the most significant bit of the result is raw key.) We address this situation by passing the text (with expanded UTF bytes) through a one-level, rotating substitution array, which substitutes full 8-bit data for all input values. What we get is full 8-bit data for the encryption process even though most, if not all, of the input data is 7-bit.

This implementation of the substitution array triples the dynamic range (the number of different values within a data string) of data from a keyboard. It is surprising that such a simple idea can do so much to mask the type and distribution of keyboard input data. I will even go so far as to suggest that any keyboard data headed for any sort of encryption process should be put through substitution-array processing of the type used here to expand its dynamic range, and the entropy of the data!

A simple analysis is that even a valid breakout (removal of all encryption except the substitution array) produces a string of numbers where all values from 0-255 are equally probable. That will not be considered a breakout (even though it really is), because breakouts are detected when they only have 70-80 different values within the text. If the text has all values from 0-255, then how can it be keyboard data? Seems substitution has some value, after all - but only to expand the dynamic range before REAL encryption begins. But it is a real value.

The substitution array is based on input data such as a key value, and actually has some encryption potential (but not very much). Each time the substitution array is accessed it is rotated by a random number of positions, so that 3 e's input will not produce 3 of the same character on output, and repeated text patterns will not have repeated substitution values. And an infinite amount of text will cause a space character, for example, to assume all values from 0-255 (decimal) with equal probability. Don't tell anyone, but it is a hidden encryptor function within the Encrypt and Decrypt functions that is present for all RASCAL versions. These functions are in the JavaScript library, (randlib.js).

  SubGen (ary, key); // replace ary with substitution array data

    Where
      ary = numerical array in.
      key = numerical array to prime 0-255 substitution gen

  UnSubGen (ary, key); // undo substitution array data

Shuffle Routines

There are times when it is beneficial to shuffle the data in a random way - perhaps even in the middle of encryption. The data is rearranged into random order from its original order. If done properly, it is extremely difficult to determine the original order of the data. The only trick is to be able to "unshuffle" the data back into its original order, and at the proper time.

We have a system where the first seed value is not used to produce the first key value, and the first key value is not used to encrypt the first data value. And we substitute 0-255 value data in place of 30-127 value data, and apply keys nonsequentially. NOW add the ability to shuffle either the data or key arrays at any time in the process.

We can apply one level of encryption, then shuffle the result before applying a 2nd level of encryption. We could even shuffle it again before giving the encrypted output to the user. You can even shuffle it two or more times in a row, but you can only mix it up so much - you cannot make the order "more" random. But you can shuffle it as often as you want - just be certain to unshuffle it for each shuffle.

It is interesting to note that if you shuffle data, then encrypt it, and then unshuffle the encrypted data before presenting it, that you have gone about as far as you can in applying keys non-sequentially to that data. If a person does not have your keys (some form of them used as shuffle keys) then there is no alternative but a brute-force attack on the encrypted data - and they must have this code to be able to do that. Also recall that the application of a substitution alphabet to the data makes it more difficult to detect when a brute-force attack has succeeded.

Most people could not begin to break a 100 character, or longer, message that just had a substitution array applied to it, and then was shuffled. And recall that this is normally done prior to applying real encryption to the data...

  Shuffle (ary, key); // shuffle array bytes based on key

    Where
      ary = numerical array in.
      key = numerical array to prime RNG
    (The return is the shuffled array)

  UnShuffle (ary, key); // undo effects of Shuffle
    (Return the original order of the aray)

The Shuffle function is the best, and is truly random (for all intents), but there is a somewhat faster routine called Shuffle1 which uses the key directly rather than an RNG. But Shuffle1 is not as good as Shuffle, and has some holes - but for a general case it is OK if followed by other obfuscators, AND the key is hashed.

  BitShuffle (ary, key); // shuffle bits of a block based on key

    Where
      ary = numerical array in.
      key = numerical array to prime RNG
    (The return is the shuffled array)

  BitUnShuffle (ary, key); // undo effects of BitShuffle
    (Return the original order of the aray)

Once the data is prepared (obfuscated), the idea is then to somehow combine a "key" with that data in such a way as to produce an encrypted value, and to do it in a way that is difficult for anyone else to figure out. The task now becomes how to generate a list of key values that are unique for any given keyboard values.

How do we generate a list of "key" values to use to apply against the original text stream? What we want is a key-stream that is difficult to detect, or figure out. The more difficult it is to figure out, the more secure our original text message is. Here we are going to start with some really simple techniques, and then get more complicated. I'll give you examples of them all. By the time we finish with this, you are going to be very near state-of-the-art, and be dealing with 100,000-bit encryption. You will be able to author routines that nobody can bust. That cannot be said of 256-bit encryption techniques!

Pseudo Random Number Generators (PRNGs)

In the world of computers, there are things called Random Number Generators (RNGs). Actually they are Pseudo-Random Number Generators (PRNGs), because it is not mathematically possible to programmatically generate real random numbers - anything deterministic (generated by code) cannot be considered "random". That is just a fact you cannot escape! (I confuse the terms, but understand that I am always talking about PRNGs!)

However, we can get really, REALLY, close - so close, in fact, that people cannot figure out what we have done to generate them. And that is a fact you cannot escape, either!

It is a constant battle between the "generators", and the "breakers", and the tide of the war is ever changing. Recall that the Allies "broke" the unbreakable German Enigma code during WWII! Never was a horse that couldn't be rode, and never a rider that couldn't be throwed! Let's get started.....

Glossary

Our "program" to generate a key list must have a seed to start it running at a particular point. Since we need to generate the same list of key values to encrypt, and then to decrypt, we must have a way to start the generator at the same point on demand.

The key-depth and cycle-depth of generators must be known before you can tell their abilities - generally, the larger each is, the better. It is also important to know something about the distribution of the keys that are output, which should be evenly distributed over all possible values. See testx14.htm for an example of basic statistical analysis on the Rand14 generator.

Even Distribution of Encryption Keys

Here I am going to talk about the Dist() and Even() functions (in randlib), and it is rather difficult for me. What it does is to apply an even distribution of encryption keys (or bits) to the entire message. Why this is desirable is the difficult part.

What an even distribution means is that over the entire message there is an equal probability of any value from 0 to 255 (decimal) being applied as an encryption key to the data, if the message length is Mod 256. If the message length is Mod 256, then there are equal applications of 133, or 201 (or any other value) applied over the length of the message.

Here's the hard part... The way to get this even distribution of encryption keys separates the actual user-input keys (from the keyboard) from the keys used to encrypt the message. In other words, no matter what the user enters as a key value, an even distribution of encryption keys is forced. It is very important to understand this idea. And it is not easy to explain... Simply, the more distance you can place between the user input keys, and the actual keys used for encryption, the better - IFF you maintain some relationship between the two. In other words, different user input MUST produce a different series of uniform keys! And the same user input will produce the same uniform keys.

The Dist () function is a simple example of how to do this. It generates an array of values going from 0 to 255, and repeating this for the entire length of the message to be encrypted. If the message is 512 bytes long, then it generates 0-255 two times within an array. You can see that there are two zero's, two one's, etc - an even distribution of values. Then the "user input key" is used to prime a PRNG which produces a shuffle key, and the array is shuffled. There are still two zero's, and two one's, but they are not in the positions of the array where they were initially placed.

It's a Kevin Bacon thing but with four degrees of separation - the user input key is used to generate random numbers that are used to shuffle a uniform array of data that is then used as encryption keys for the message. Any connection between the user input key, and the encryption keys is vague, and you are forced into a brute-force attack. In other words, the user input key is not recoverable from the output of this process because the values are only used to produce a PRNG string of data, and the "values" of the PRNG data are not used directly, only their relative magnitudes (compared one to the other) as sort keys. It is the results of the sort process that are used to provide encryption keys for the message.

This same scheme of using user keyboard data to mix up an even distribution of 0-255 values is also used in Rand14 (RC4 type generator.) The key data is several steps away from the actual data entered on the keyboard. But the Dist function is the first pure application of this approach. Dist is very secure because of the substantial unlinking of keyboard data from the keys used for encryption...

How many different ways can a 256-byte element array be arranged? 256 factorial. 256! = 10^450 (about).

The Dist() function shuffles bytes, but the Even() function works by shuffling bits. In Even(), we create an array of bytes each with an equal number of 1's and 0's (01010101, binary). Then we shuffle the bits of the array. There may not be an equal number of all possible byte values, but there shall be an equal number of 1's and 0's in the key values used to encrypt the data. A different sort of even distribution concept - Dist() produces an even distribution of bytes, and Even() produces an even distribution of bits. If the shuffle is random, then the keys produced will be random from either function.

Logical Congruential Generators (LCGs)

These simple generators are good for things like shuffling cards, but that is about it. The reason is that they have but a single sequence of output values. Different seeds just start the cycle at different points within the circle of possible output values. And we are only talking about 10^9 different starting points for most code, and that isn't very many, believe it or not.

A surprisingly useful function within encryption is as an obfuscator, in that some "sophisticated" generators can be difficult to initialize and may produce an initial burst of rather predictable values, especially when alphanumeric keyboard input is used for a seed. These simple generators can hide that sort of behavior for a little while, until the better generator settles down. Some of the RASCAL examples make use of LCGs just to keep things honest, and to "load" LFGs with initial data before the user keyboard key is applied.

Special Hybrid Generators.

These are of my design, and intended to make the seed very important in the encryption of short messages (< 100,000 bytes). If they work like I think they do, then they significantly contribute to the complexity of encryption - if not, then they just serve as a background noise. Always use them with "proven" generators until I verify their credibility.

They are based on the idea of cascaded LCGs, where each initial value is based heavily on a specific seed value, and uses a "carry" from one LCG in the cascade to the next. Each new value is then based on the current value, plus a value carried into the current cell by the previous cell, and then some math operation. There are effectively 512+ LCGs within the cascade, and which LCG is the first one to contribute a key value is determined by an RNG that is primed with a value derived from the seed.

This distribution of values is pretty good, but based rather heavily on the value of the seed. The first few values after initialization may not be good values and may betray something about the seed, but I attempted to hide that a little. Perhaps the first cycle of values should be discarded. Rand1 (in randlib.js) is the hybrid generator.

RC4-type generators.

These generators are interesting because of their simplicity, and the ease with which we can increase their complexity. They can be used by themselves, but often serve a higher use by being used with other generators - because they divorce the actual seed from the internal key, they can be looked on as pseudo-hash algorithms, and move the complexity of other generators used with them closer to the classification of "Cryptographically Secure Generators" (CSG). Understand the base RC4 idea, and it quickly becomes apparent how to change it to make it more secure.

The versions of RC4 I use is in Rand14 in the JS library. Rand14 is (mostly) a true representation described in the literature, and wants a 256-byte key which can be produced by using the Hash2 function to generate it independently of the actual keyboard seed size. This means Rand14 can take any keyboard input up to about 700 characters and produce unique encryption keys.

Hashing of a long input seed is a simple way to have at least one Cryptographically Secure algorithm in your stable. It is simple, fast and sound. Use Rand14, or Dist or Even as generators in every algorithm you develop, and you force a brute-force attack.

Lagged Fibonacci Generators. (LFGs)

LCGs work with the same circle of values - that is not true of LFGs, which produce a different circle of values for every seed they are given (the way I do it). This makes them vastly superior to the LCG.

The cycle depth is many orders of magnitude above that of the LCGs, but more importantly the LFGs maintain (in memory) the last few state versions, and the number of state conditions they remember may be primed with seed values from the keyboard. This is what makes it possible to have 4,096+ seed values per LFG (like I have), and, therefore, 10^27,500-bit seed depth per LFG function used.

The LFG is sufficient to produce an effective one-time-pad-LIKE encryption system, but what it really does is to give access to the seed-mechanism because it "remembers" the last n states of the generator. Part of the seed initialization process is to put seed characters (from the keyboard) into all of those array positions that are remembered, so that the seed characters can produce a key based on that seed as output from the generator. Of course, we must divorce the key element application from the message element (explained in detail, later) - neither the message (M[i]) element, nor the key (K[i])element must come together (M[i] XOR K[i], nor can M[i] XOR [K[k] be followed by M[i+1] XOR K[k+1]. Ensure that, and you may encrypt with an LFG and be CS! (see Appendix)

If you use my approach to encrypt small messages (less than 100,000 characters each), with different seeds, and use seeds that are more than 100 characters long (perhaps cut and paste from an on-line dictionary) there is no way this can be broken. A single 100 character seed amounts to 650-bit encryption! And that means it is an effective one-time-pad LIKE system that uses deterministic keys.

Rand7 Rand8 and Rand11 are my LFGs. Rand11 is my workhorse.

Hashing

Hashing is basically an irreversible encryption technique. You can not work backwards from a hashed value and discover the original value by any means other than the brute force method of trying a value to see if it produces the same hash. It is an attempt to condense or summarize a character string while hiding the string that produced it. It is mainly used for things like password verification without having to know anything about the real password, but does find its way into ciphers as a method to further hide the keys from an observer. An example is that the Mersenne Twister is not considered CS (see link at end) unless it has a secure hash algorithm incorporated with it to mask the output from it.

How a hash algorithm would be used to hide the keys would be to take a key and hash it, and then use the hashed output as the seed into the PRNG. That would mean that you could not get any hints about the real key that was used with crypto-analysis techniques because the output of the generator would depend on the hashed version of the key, and not the real key. This really means that you are almost reduced to a brute-force attempt at breaking the code because the real key value has been hidden from you. If you obfuscate the keys with a random shuffle then you have FORCED a brute force attack.

Any RNG that can accept keyboard keys may be vastly improved by taking that keyboard input and hashing it, then use the hash output as the seed into the PRNG. If your hashing algorithm is any good, then you have just made your RNG much closer to "cryptographically secure." Divorce which key was applied to which message element and you have a CS system. In my Hash2 function the hashed value has no order-relationship with the input to the hashed. More in the appendix...

Again, I warn you that a single character key, hashed or not, will fall to a brute-force attack in seconds (you must assume your source code is compromised.) You MUST use keys that are as long as possible, or you will be broken in seconds. Minimum total keyboard character key length to be sound is 200 characters (1,200-bit encryption). Anything less and you are fooling yourself. That statement is true in 2007, but in 2010 the key length must be over 1,000 characters long (6,000-bit). Just a warning that time marches on...

I offer a rather simple hash algorithm (Hash2) just to demonstrate the technique - it is NOT secure (it is very close, though), but does serve to hide the original text. It was designed to feed into the SubGen and RC4-type generators, and does a pretty good job. It allows Rand14 to accept seeds, and produce a key stream that is only indirectly related to the seed. When the key stream from Rand14 is combined with another generator the result is much more secure because Rand14 now hides the input key a little more.

Putting a hash generator with Rand14 makes it much harder to bust. You can also use Hash2 to generate an obfuscated seed into any other non-LCG RNG I offer. Just do not forget that Hash2 is not secure, but it acquaints you with the idea of the hash function. (Just between you and me, my simple version is much better than MD5! Hash2 takes longer to execute, but produces much better output and is much simpler!) Also, the power of my little example (two loops and two lines of code at the heart of it) happens because of the power of the Rand11 RNG function. At the heart of every good hash function is a good RNG - the better the RNG, the more simple the hash function is to code. Near-CS is achieved by the partial shuffle of the data, which cannot be undone.

Another important function of hashing is to hide an original password, and use the hash of that password for password verification. In this case a person enters his password, and that password is hashed, and the result is checked with the hash on file for that user. This works because the original password is not on the system anywhere (cannot be hacked), but its hash (a one-way encryption process) can be used to verify the password. You cannot start from a hash and work backwards to get the password that produced that hash. A hash is not reversible (but can have collisions that help defeat it.) The better the hashing algorithm, the fewer collisions. A perfect hash would be able to take about 300% more data than the hash output character length without collisions because about 30% of the 7-bit data from a keyboard are not printable, and the hash function expands that input to 8-bit. An interesting possibility is to "ZIP" a file, then hash it to get fewer collisions.

It is rather difficult to analyze all that is going on within Hash2, so I suggest you try to look at the source. It does not translate well into other languages - that is why I provided links to other analysis of hash functions. You don't really need a hash function with the RASCAL system, but it DOES make things much harder to bust!

Every test I put Hash2 to beats MD5. That doesn't mean my simple example is better, it just means I have not been able to find where MD5 is better, and it handles all of the known problems of MD5. Hash2 ain't no slouch! You get an idea about flexibility by looking at the call structure...

The JS call is...

  Hash2(a, len, flg);

  Where
    a    = the ordinal array to hash (any length).
    len  = the length of the hash output string (10-4096).
    flg  = output return format
           0 = 0-256 numeric format (ordinal).
           1 = hex notation of the data (doubles display length)

A good hash algorithm has just two properties - it is not programmatically reversible and requires a brute-force attack to determine the input, and there are as few collisions as possible (different texts giving the same hash output.) My Hash2 algorithm meets both of these requirements.

I also provide a hashing template that you can use to build a hash function within the randlib.js code, called Hash0. It is something to play with, and I give a brief description of it here. See Hash0 test page for something that you can execute to see what your changes to Hash0 actually do to the data.

Hash0 doesn't really do anything except force the output to a length that is a multiple of "len" and expand the input into that length. Try typing in "test" without the quotes and you will see what I mean. The hex output will be a repeating pattern of the input. Its use is as a template so you can see what happens when you start to change things. It is interesting that just changing lines 1 and 2 can create unrecognizable output - especially changes in line 2.

It is difficult to put algorithms like MD5 into this template because they are more interested in the bit-lengths of different types of integers within a language (like all block cipher systems are), and I operate in character mode which allows key lengths of thousands of keyboard characters. If you are interested in the bit length of integer types you are going to have real problems handling a 1,000 character key, or dealing efficiently with keyboard input.

UTF Data Problems

Current standards include some special formats with what you may think of as "text" data. Us old-timers think of "text" data as a subset of 7-bit ASCII, but that is not the case today. UTF data may include full 32-bit values within the text to indicate special characters that are not included in the 7-bit ASCII character set. These values must be handled (if they occur), or we will get into problems like addressing array elements beyond the array boundaries. This is what the FixUTF and RstUTF functions do for us - they convert values greater than 255 into several bytes of 8-bit data for encryption. This happens internally within the Encrypt and Decrypt functions of RASCAL examples, but should be mentioned.

The approach is to take seldom-used values (from 1-5, decimal) and use them as markers for special bytes to follow. Any values above 255 are converted into 8-bit values with a prefix byte telling how many bytes follow. For example, 257 would be converted to three bytes with hex values of 02, 01, 02. The first byte indicates the number of following bytes. This happens before the data is encrypted.

After the data is decrypted, it is scanned for marker bytes, and 02, 01, 02 would be converted back to 257, decimal. The UTF functions are located in the JavaScript library (randlib.js).

  FixUTF (strn);  // break down UTF into 8-bit bytes

    Where
      strn = string data.
        (a numerical (ordinal) array is the output.)

  RstUTF (ary);  // restore UTF codes.  Array in, string out

Sorting Data

There are some internal sorts in the code to handle the sorting of numeric data. Most languages have internal sort routines, but they are not very compatible with the JavaScript sort. I include my own sorts so you can see what is going on. Of main interest is that they are numeric and not character sorts, but I am not sure I can fault character sorts in this application, except that in JavaScript performing a character sort on numeric data changes that data into character data, and I want numeric data - extra conversions are not a good thing. The sorts are also in the JavaScript library (randlib.js).

  QuickSort1 (aray, left, right);

    Where an array of numbers between the left and right elements
    is sorted.  QuickSort is fastest, but uses recursion which is
    not allowed in some languages.

Encryption Drivers

An encryption driver is something that accepts a text message, a seed for the encryption of that message and encrypts it - message in, encrypted text out. It initializes the key generators, and uses some algorithm to apply key data to the text data. This is the heart of any encryption system, and where all the games are played - like encrypting the data from back to front, rather than front to back, before returning the encrypted text. The reason we can do this is that we have the entire text, AND SEED, available to us to work with, and not just one letter of it at a time. This gives us great power and versatility over the limited thinking of stream-cipher ideas...

To be illustrative, see RASCAL0 JavaScript.

In my RASCAL4 example, I use several mathematically unrelated generators in combination to produce keys, and apply those keys in different directions to the text to be encrypted. If you use long seeds, and never repeat them, then I think this is unbreakable except by brute force. I KNOW that stream analysis techniques just won't work for RASCAL4,

Selecting Seeds

Several times, above, it was mentioned that you should select seeds that are longer than 100 keyboard characters, perhaps by using online dictionaries. The question becomes how can we do this in a way that is simple, and transmittable to the receiver.

One idea is a codebook, which is a list of key sources that both you and the receiver of your messages have somewhere. It doesn't have to be something physical that you both have in your possession - it could be something like a list of words that appear on a computer somewhere, and what you provide with your message is an index into that list which you use to look up a word in an online dictionary. What you both agree on beforehand is just how much of that definition you will use - something like just the first 19 lines (with the resolution of both your screens set to the same value.) Your actual message would then look something like this...

19.5,7
 31772 51d2b 064f5 a5d66 08583 b0a2c 0a751 6631b 58204 34a21 09592
 f642f 38745 86604 120f3 1076d 25637 33b25 3f142 7546b 20797 60f62
 15562 11e76 3a282 b6a54 326b5 31150 2a0e6 90577 

Where 19.5,7 indicates the key you used to encrypt the message. It could be a Bible verse, or a verse from the Koran. Or a chapter, sentence, word out of Moby Dick that you look up a definition for...

But all of that stuff is rather easily determined because people can get access to the sites you visit by hacking your ISP. This is why you must also have a "private" seed that you just remember. Something like your mother's name plus the month you send a message. Notice that the RASCAL4 example has a two-seed system, as coded - here you would put the indicated Bible verse into seed 1, and your "private" seed (plus the month) into seed 2.

So, seed 1 might be "In the beginning...", and your 2nd seed could be "Martha - July, 2006.". See how it works? To really be secure you need two seeds. One that you indicate that always changes (like 19.5,7), and one based on a "private" value that has an ever-changing value within it like the month the message was sent. This is a simple example of something that cannot be broken. BUT, every few months you must change your "private" seed value! (In cryptography, words like "never" are NEVER true!)

For single seed systems, a "private" value that has the current date appended to it is really a good seed! If your private value is, "This is my private value.", then a really good seed becomes...

  This is my private value. - 6 July, 2006.

That 41-character seed is the equivalent of 256-bit encryption. And you will notice that changing the day from 6 to 7 will generate an entirely different key-set for the data. This is not the case for most encryption systems. In my system, change a single character in the seed, and you get an entirely different key-stream. But more secure is to change two things, perhaps your private value, the date the message was sent, and the headline of the London Daily Mirror for the previous day (because you may send an early morning message.)

  This is my private value. - 6 July, 2006 - IRA BOMBS PARLIMENT

Note that caps make a difference - they are different characters from lower case characters. That last seed is 441-bit encryption, by the way.

While on the topic of seeds, what does it mean to have 128-bit encryption? It is a reference to the seed depth, but what does it mean to something like the complexity of a brute-force attack? How many seeds would you have to try before you got a bust with 128-bit encryption?

128-bit encryption means there are 2^128 possible seeds available to use. Then a brute force attack would have to input (on average) half that many seeds before they got plaintext. So, how big is 2^128? My machine is a 3 gigahertz machine, which means that its clock runs at 3 billion cycles per second - that is 2^30 cycles per second. Let's say we had a monster machine (like NSA has) that can process 2^40 seed values per second, how long would it take to bust 128-bit encryption by brute force? It would take 2^128 / 2^40 = 2^88 seconds. And there are 2^25 seconds in a year, so 2^88 / 2^25 = 2^63 years. That's quite a few years.

The way 128-bit encryption is broken is not by brute force, but by flaws in the algorithm that exist because it is deterministic, or by doing something like creating "dictionaries" of key streams of popular algorithms beforehand. The NSA just completed a project where they used thousands of computers to develop these dictionaries. This allows shortcuts, or the possibility of reverse engineering. We see from the RC4 article (link at bottom) that you can predict keys after 1 gigabyte of received data (using the same seed value). So the problem to bust RC4 is how long does it take to get 1 gigabyte of data. That can be done, and has been done to bust RC4.

If you use an approach to encryption that effectively hides the details, then you force a brute force approach. And that means that the real question for encryption becomes how do you hide the details in a way that forces people to the brute force approach. Or, equally important, is it possible to hide your algorithm if you publish your source code. I view these questions as the most important questions in the field of cryptography today. One rather obvious answer is to use an algorithm that not everyone else is using, or add something to the one you are using like obfuscation.

Some partial answers are based on the limitations forced on us by stream-cipher thinking. If we can get away from that model, then the doors are opened. I understand that there are valid uses for stream ciphers. In the military you just broadcast key-data until a message becomes available to send (you are hiding message count and message length), and you just send the message when one becomes available. Or somebody typing live into a machine (or talking live into a machine). I understand those needs. But I suggest that many needs can be better met by moving away from the ideas of stream ciphers.

The exchange of sensitive corporate data does not require the stream cipher approach, and neither does sensitive government information like most diplomatic traffic. I would suggest that the general approach for low-volume traffic is best served by something other than stream cipher thinking - especially those with a message length of only several tens of thousands of characters.

Again, is it possible to hide the actual details used if you publish your source code? I give you my source code, and an encrypted message - do you have a good way to attack it other than brute force? This is the question that needs an answer. The Mersenne Twister is a prime example, but it has problems, I would suggest. The biggest is that it is a stream cipher with limited starting points - it should have more starting points, and get away from the idea that the first key encrypts the first data byte, and the last problem is that it is a stream cipher!

Rand14 with hashed seed, and shuffled message should give us CS credentials, and the other stuff would mess up the message so much you could not begin to figure it out. You are forced into brute-force analysis for more than one reason, I suggest. (Even brute-force methods will not work unless you have the algorithm - but you must assume they have your source code.)

By getting away from the limits of the stream-cipher idea we can ignore most of the methods used to attack them and be secure. By applying the key data in other than front-to-back, sequential manner, most of the stream cipher problems just go away - it becomes EXTREMELY difficult to tell what specific keys (multilevel encryption like RASCAL4) were applied to a particular message character. Or even what specific seed values were used to generate any specific key. You cannot figure it out even with the source code because you do not have the seed, and the seed is used to prime an RNG that changes the initialization of all core functions - different seeds produce different initialization parameters. Another way of saying this, is that the "intelligence" of the system is almost entirely contained within the seed value, and not the code. So look at my code... Without the seed you don't have a starting point for analysis. Perhaps someone can see an approach to bust it, but I cannot - I would love to be educated if you do.

Imagine applying keys to data from back-to-front with one generator, then to all even locations with another, then all odd locations with yet another - changing directions, maybe starting in the middle of the text and working to each end. Maybe generate a key stream and then "shuffle" it before applying it (just be sure you can undo the shuffle). The ideas are endless if we get away from the ideas of stream ciphers, and work on messages that can be contained in memory. I just stimulate your imagination.

Stream cipher analysis does not work on RASCAL4, You may be able to bust RC4 if it is used in stream mode (front to back encryption), but what happens to that analysis if we use RC4 to encrypt from back to front, and use other algorithms in different directions on top of it? All, I repeat ALL, analysis techniques of modern systems (except brute force, and substitution analysis) are based on stream cipher analysis! A L L ! What happens if we do not use the stream cipher approach? That is a very interesting question. That is why I keep asking it!

Well, I lie just a little - there are approaches that take your source and apply their own text and seeds and see what happens to see if they can figure out a black-box approach to get some information about what you are doing. "Black box" because this level of analysis is not really interested in your algorithms, they are looking for patterns and holes. But that approach is once again foiled by using extremely large seed values, never repeating them, and encrypting only relatively small messages (< 100,000 characters). The black-box approach requires large amounts of data to analyze. If you do not provide those large amounts of data for any given seed value, then there is no worry. ALL current analysis is based on the interception of large amounts of encrypted data operating under a single seed value. Small amounts of data for any given seed make all of this analysis worthless.


The RASCAL encryption system

This tutorial covers some basic information about the encryption of data. It uses JavaScript so you do not have to have any compilers to see how it works. Most of the examples are easily transferred to C++ or whatever language you want to use - I am talking about general concepts with some simple JavaScript examples. The links at the top of this page offer some executable HTML and JS examples of the RASCAL system from very simple, to medium complexity.

I offer these examples in the face of what I consider to be some rather serious weaknesses in what I call "stream-cipher-think." There are places for stream ciphers in the world - I do not dispute that. I just dispute all the attention given to stream ciphers when there are more flexible alternatives for many applications. It seems that every cipher is based on the "stream-cipher" concept.

What I have tried to do, here, is create an encryption system based on keyboard/file input for both data and keys that is keyboard/language oriented rather then program oriented. I also offer key lengths up to 4,096 characters per key - and it is all from a keyboard using normal language and keystrokes. I do not use long, comma-separated numbers, or any 96-character ASCII junk. And my stuff is state of the art - in fact, it is beyond state-of-the-art because I do not trap myself in the "stream-think" environment.

Common Encryption Problems

There are some common problems with most encryption systems that need to be understood and dealt with. These problems can be called "holes" in the algorithms that allow the Cryptographer to bust the messages of people using systems with these problems. The RASCAL system deals with all of them.

Cryptographic Overview

Methodology Overview


Here is an example of the system at work. Enter a message to encrypt, and one or more keys, Then hit the "encrypt" button to see the encrypted output. You can also decrypt the message.

Input text to encrypt -


Input your 1st key -

Input your 2nd key -
   

Encrypted text -




Original text back -


This example shows what happens if you receive an encrypted message...
See if you can figure out the message before you enter the keys of "11293", and "test it" (which would have been sent to you in a separate email, or even by snailmail.) There is no way to figure out what it means by reading what you get by right-clicking on the page. Also try changing the keys a little and see what happens - you have to have the fully correct keys, or you get no hints about what the message is. A partially correct key does not give a partial breakout because the individual values of the keys are used along with a "positional-sum" of the key values as an internal key-stream-primer! "11112" produces a totally different key stream from "11113" in every generator. There is no such thing as similar keys producing similar key-streams in this system.



1st key -

2nd key -
   

Your special message -



As a little reassurance I have encrypted messages as large as 2,000,000 characters and seen no problems. Much beyond that, and most email systems will not accept them. Plus it begins to take a really long time. But time may be a penalty you must pay for unbreakable encipherment of your data. RASCAL4 takes about 1 second per 10,000 characters of text to encrypt/decrypt on modern machines. (It keeps getting slower as I add things to it.) But it cannot be busted by any means that I am aware of as long as you use seeds longer than 100 characters (650-bit encryption), and never repeat them.

It is interesting, but the "time" it takes to decrypt something actually works in your favor - the longer it takes to decrypt, the longer it takes to bust.

Tuning RASCAL to make it unique to you.

If you have the expertise to modify the code, then there are things you can do to make the system "yours". It is very easy to modify a very few variables so that the same message encrypted with the same keys produces completely different encrypted text.

As an example, I created a prefix to "randlib.js" within which you can modify variables (global constants) to the system. Changing any of the variables within that prefix area will totally change the encrypted output - even if only changed by a single character.

Appendix

Why seeds longer than the message work.

I may need to explain why a seed longer than the message is useful and works (RC4-based generators use a hash function.) The reason is that the first character of the seed is not used to produce the first key character, and the first key character is not used to encrypt the first text value. If you will notice in the initialization code of Rand11 and Rand7, and Rand14, there is a value calculated based on the seed supplied. This value will change if just a single character in the seed changes. The value is used as a seed to a local RNG that produces values to indicate which seed value is the first to produce a key, and which is the first key value to be applied to the data.

If you use a seed longer than the message to encrypt, then you cannot be certain just what portion of the seed will be used. If the seed is longer than the message, then there will be some portion of the seed used that is determined by the seed itself. A RNG is primed, and it determines many parts of the initialization process - and one of those parts is which is the first seed character to be used. So, seeds longer than the message to encrypt actually work but you just don't know what portion of the seed is going to be used, and (more importantly) the seed determines the depth of encryption. Use long seeds (> 100 characters) no matter the length of the message to encrypt! And REMEMBER that the encrypt and decrypt keys must be exactly the same!


Brute Force Breaking

It has been mentioned that one way to break an encryption is by "Brute Force" - It might be a good idea to explain that.

If you have the code used to encrypt something (a requirement), then you can start feeding it keys to see if any of them produce a break. You start with "a", then go to "b", ... , then go to "aa", then to "ab" ... etc, going through all the possible keys there are.

This can take a rather long time, as you can see, but will "bust" an encryption based on a really short key. To defeat it you have but two alternatives: Hide your source code, which is impractical in today's world, OR, make it so difficult that even WITH your source code you force the breaker to use the brute force method.

If you force the breaker into brute force, then the time it will take him is based on the time it will take to go through all the possible keys until he gets a break.

As a simple example, say I tell you that I used only keyboard numbers for my key, and that the maximum sized number I used was 4 digits (10^4). If you enter one key per second, how long will it take you to find the number I entered for my key? The analysis is pretty simple - for 10,000 numbers, one number per second, how long will it take you to find my number? At most, it will take 10,000 seconds - how long is that? Well, there are 3,600 seconds in an hour, so 10,000 seconds is about 2.8 hours. You can find my number in less than 2.8 hours! The "average" would be 1/2 that. That is a simple example of brute-force breaking.

A little more complex example is if I use 10, lower-case letters as a key - how long would that take if you could enter 1 letter per second to get a bust? How many different combinations of letters could I enter with just 10 different lower-case letters? The answer is 26^10, which is a pretty big number... The first letter may be any one of 26, and the 2nd letter may be any one of 26, so with just two, lower-case letters we are talking about 26*26, or 26^2 (aa ab ac ad,,, ba bb bc, etc). (Here's where you need some probability and statistics mathematics - it's called permutations.)

How big is 26^10? It is 10^14 (This is exponents and logarithms.) And how long is 10^14 seconds? Well, there are 3,600 seconds in an hour (60*60), and 86,400 seconds in a day (3,600*24), and 32*10^6 seconds in a year (86,400*365.25), so it is about 10^14/10^7 years = 10^7 years. That's 10 million years.

Now, one character per second is not realistic, but you can see with the above examples that increasing the key-complexity can have a very great effect on the time a brute-force attack would take. A brute-force attack is the slowest there is, and if you can force any attack to the brute-force level, then if your keys are long enough there is no way it can be broken. You want to force your attacker into the brute-force mode, and that takes key depth, and algorithm complexity.


Primitive Tap Points

For LFGs you take the first number, and ONE other number as tap-points. You must have at least a number of values in the storage array equal to the first number. I only have the values for a few degrees. (Sometimes it is a good idea to take even and odd numbers.)

Degree 521, 32, 489, 48, 473, 158, 363, 168, 353.

Degree 1279, 216, 1063, 418, 861.

Degree 2281, 715, 915, 1029, 1252, 1366, 1566.

Degree 3217, 67, 3150, 576, 2641.

Degree 4423, 271, 4152, 369, 4054, 370, 4053, 649, 3774, 1393, 3030, 1419, 3004, 2098, 2325.

Degree 9689, 84, 9605, 471, 9218, 1836, 7853, 2444, 7245, 4187, 5502.

Degree 19937, 881, 19056, 7083, 12854, 9842, 10095.


Summary

There were some ideas that can get your mind to working. I suggest that there are many other ideas than CSGs that should be considered. Have fun thinking about them...

So, I can demonstrate that it is possible to deal with things other than CSGs to make a message Cryptographically Secure (CS). And the code is easy to program in any language - I use JavaScript just to demo the concepts. We can mess with the message, the keys or the encrypted results to obfuscate things to the point of making it CS.

My preference is to obfuscate the message as the text comes in - either as a single block, or buffered blocks of about 1,024 bytes. 1,000 factorial for a random shuffle of a 1,000 character message is about 10^3,000 different arrangements we can put that message in. Does that interest you? If I shuffle a 10,000-character block of text, do you have any idea what 10,000 factorial is?

We must break away from the limiting constraints of 256 bits in stream, block, and discreet messages pretty soon - why not now? Jump to 1,024 BYTE blocks, or a block-size that is the message length for discreet messages, right now and get it over and done with. Then we have a wealth of new ideas to apply to the encryption problem. I suggest that 256-bit "stream-think" just doesn't cut it anymore.

With a 3ghz processor, can't you design a plug-in crypto card the size of your palm using 10,000 byte seeds, a 16,384-byte message buffer that dumps 8,192 bytes at a time for encryption AFTER the 8,192 byte block has been shuffled, and had a substitution alphabet applied, then hit with a hashed 8,192-bit key from an LFG working on a cycle far beyond 19,137 bytes? I could do that in my sleep! I've shown you how to do that right here... 256-bits? Pffft.


LINKS
Basic Cryptanalysis.
Stream Cipher Attacks.
Cracking simple encryption schemes.
Side-channel attack on symmetric ciphers.
Definition of Random numbers.
Normal distribution.
Chi-squared tables.
Logical Congruential Generators.
Lagged Fibonacci Generators.
Logical Feedback Shift Registers.
Alternating Step Generator (an LFSR design.)
Shrinking Generator (an LFSR design.)
Cryptographically Secure Pseudorandom Number Generators.
Cryptographic hash functions.
RC4 Cipher.
MD5 hash function.
SHA hash function.
Feistel shifting nerworks.
Rijndael substitution-permutation nerworks.
Confusion and diffusion.
Obfuscation.
Factorials.
Knuth Shuffle Algorithm.


Contact me concerning this article at rascalcode@aol.com. Mention "Encryption Tutorial" in your note.


eXTReMe Tracker