Portable pseudo-random reference sequences with Mersenne Twister using GNU Octave. Mastrave project technical report

2014-01-21T13:57:28Z (GMT) by Daniele de Rigo


de Rigo, D. (2012). Portable pseudo-random reference sequences with Mersenne Twister using GNU Octave. Mastrave project technical report. FigShare Digital Science. doi: 10.6084/m9.figshare.94593


Portable pseudo-random reference sequences with Mersenne Twister using GNU Octave


Mastrave project technical report


Daniele de Rigo

Abstract: Computationally intensive numerical tasks such as those involving statistical resampling, evolutionary techniques or Monte Carlo based applications are known to require robust algorithms for generating large sequences of pseudo-random numbers (PRN). While several languages, libraries and computing environments offer suitable PRN generators, the underlying algorithms and parametrization widely differ. Therefore, easily replicating a certain PRN sequence generally implies forcing researchers to use a very specific language or computing environment, also paying attention to its version, possible critical dependencies or even operating system and computer architecture.

Despite the awareness of the benefits of reproducible research is rapidly growing, the definition itself of “reproducibility” for PRN based applications may lead to diverging interpretations and expectations. Where the cardinality of PRN sequences needed for data to be processed is relatively moderate, the paradigm of reproducible research is in principle suitable to be applied not only to algorithms, free software, data and metadata (classic reproducible research), but also to the involved pseudo-random sequences themselves (deep reproducible research). This would allow not only the “typical” scientific results to be reproducible “except for PRN-related statistical fluctuations”, but also the exact results published by a research team to be independently reproduced by other scientists - without of course preventing sensitivity analysis with different PRN sequences, as even classic reproducible research should easily allow.

However, finding reference sequences of pseudo random numbers suitable to enable such a deep reproducibility may be surprisingly difficult. Here, sequences eligible to be used as reference dataset of uniformly distributed pseudo-random numbers are presented. The dataset of sequences has been generated using Mersenne Twister with a period of 2^19937-1, as implemented in GNU Octave (version 3.6.1) with the Mastrave modelling library. The sequences are available in plain text format and also in the format MATLAB version 7, which is portable in both GNU Octave and MATLAB computing environments. The plain text format uses a fixed number of characters per each PRN so allowing random access to sparse PRNs to be easily done in constant time without needing a whole file to be loaded. This straightforward solution is language neutral, with the advantage of enabling wide and immediate portability for the presented reference PRN dataset, irrespective of the language, libraries, computing environment of choice for the users.

Naming conventions:

Each file


contains a sequence of N pseudo-random numbers, uniformly distributed (generated using a Mersenne Twister with a period of 2^19937-1, as implemented in GNU Octave version 3.6.1).

The extension may be “txt” for the pure text sequence of PRN (35 characters – including the endline one – and one PRN per each line) or “mat” for the corresponding format MATLAB version 7 (containing a structure with two fields: the filed “values” contains N numerical PRN in double precision; the field “string” contains a matrix of characters witn N rows and 24 columns – the endline character being omitted – the last one fulfilling the constraint to only contain digits whose value is “0” ).

Download (permanent URLs aside from the ones provided at http://dx.doi.org/10.6084/m9.figshare.94593):

http://mastrave.org/doc/refdata/fs94593/10.txt      pseudorand_seq_10.txt
http://mastrave.org/doc/refdata/fs94593/100.txt     pseudorand_seq_100.txt
http://mastrave.org/doc/refdata/fs94593/1000.txt    pseudorand_seq_1000.txt
http://mastrave.org/doc/refdata/fs94593/10000.txt   pseudorand_seq_10000.txt
http://mastrave.org/doc/refdata/fs94593/100000.txt  pseudorand_seq_100000.txt
http://mastrave.org/doc/refdata/fs94593/1000000.txt pseudorand_seq_1000000.txt

http://mastrave.org/doc/refdata/fs94593/10.mat      pseudorand_seq_10.mat
http://mastrave.org/doc/refdata/fs94593/100.mat     pseudorand_seq_100.mat
http://mastrave.org/doc/refdata/fs94593/1000.mat    pseudorand_seq_1000.mat
http://mastrave.org/doc/refdata/fs94593/10000.mat   pseudorand_seq_10000.mat
http://mastrave.org/doc/refdata/fs94593/100000.mat  pseudorand_seq_100000.mat
http://mastrave.org/doc/refdata/fs94593/1000000.mat pseudorand_seq_1000000.mat


MD5 checksums ( http://mastrave.org/doc/refdata/fs94593/md5 ):

d25fd2747eea3c1ab0aa81ef64aa7769        pseudorand_seq_10.txt
a26711dfc2fafa7fac3dc0d1cc6472cd        pseudorand_seq_100.txt
eb54b41e8dd7c946a9799342c27f4c90        pseudorand_seq_1000.txt
bdf6b4afd237ccf0fe4a97bf5c847f1d        pseudorand_seq_10000.txt
20b8398775a93e59533e9f14ba402caa        pseudorand_seq_100000.txt
00b1d615be143d5dd093ba6bd066a833        pseudorand_seq_1000000.txt


SHA1 checksums ( http://mastrave.org/doc/refdata/fs94593/sha1 ):

837e0f2793d1b0767f6eb03868a7e081ea530073    pseudorand_seq_10.txt
2bb59e7cc5369eb896df86061e720baaa4da1d96    pseudorand_seq_100.txt
79ae593da0b052bf11713155cd6e4f7d3906baff    pseudorand_seq_1000.txt
e23bd2e157f5447096f36e704a84aa77224ab9ae    pseudorand_seq_10000.txt
3b4a088752cebee4158a8cebfdd88193c4a03872    pseudorand_seq_100000.txt
aa9fb09d59c9b331fbba2360c345e47ede032b0e     pseudorand_seq_1000000.txt

The Mastrave modelling library offers the module rprand (acronym for  reproducible pseudo-random generator) which allows all the PRN reference sequences to be exactly reproduced in a variety of versions of GNU Octave and MATLAB computing environments, by silently downloading and importing the corresponding published reference files.

The PRN sequences can also be easily imported and used in a variety of languages, operating systems and computer architectures (see Appendix B for a further discussion). This is straightforward to do by directly importing the plain text files listing the PRN reference sequences in csv format.

Therefore, users and applications of the published PRN reference sequences may not be directly interested in their exact numerical reproducibility. The exact numerical reproducibility by re-generation of the PRN sequences (deep reproducible research) is the subject of the Appendix A.

Appendix A: reproducing the PRN reference sequences within GNU Octave

The proposed sequences can be numerically reproduced within the GNU Octave computing environment. The GNU Octave version with which the sequences have been generated is the version 3.6.1.
Alternative strategies for generating reference PRN sequences could have implied dedicated source code to be released as an autonomous free software package.

This possible strategy was discarded due to the need for the PRN sequences to be as reliable as possible so to offer long-term general reusability. Extensive testing of both PRN generators' algorithms and implementations is of obvious importance. While systematic, exhaustive testing of all aspects of nontrivial code is almost impossible, linking this phase to already established free software numerical packages with broad diffusion and endorsement among computational scientists is perhaps the safest way for mitigating the risk of generating unexpectedly biased PRN sequences.

GNU Octave is a well-established and widely used environment for computational science applications. It is also free software and part of the GNU project, which is one of the most relevant free software projects. GNU Octave belongs to the GNU list of high priority projects, where its development is supported as high-level language for numerical computations. This further corroborates the selection of GNU Octave as a suitable free software environment for ensuring the long-term availability to run the PRN sequences' source code.

The following codelet generates the double precision floating-point PRN reference sequences (GNU Octave language):

   assert( exist(  'OCTAVE_VERSION'         ) )
   assert( isequal( OCTAVE_VERSION, '3.6.1' ) )
   % Initializating the Mersenne Twister PRN generator
   rand( 'seed', pi )
   rand( 1000, 1000 );
   % Generating the reference sequences
   pseudorand_seq_10      = rand( 10 , 1 );
   pseudorand_seq_100     = rand( 1e2, 1 );
   pseudorand_seq_1000    = rand( 1e3, 1 );
   pseudorand_seq_10000   = rand( 1e4, 1 );
   pseudorand_seq_100000  = rand( 1e5, 1 );
   pseudorand_seq_1000000 = rand( 1e6, 1 );

Compliancy checks (GNU Octave language: the Mastrave modelling library is also required):
   assert( rprand( 10 , 10 , 0 ) == pseudorand_seq_10      )
   assert( rprand( 1e2, 1e2, 0 ) == pseudorand_seq_100     )
   assert( rprand( 1e3, 1e3, 0 ) == pseudorand_seq_1000    )
   assert( rprand( 1e4, 1e4, 0 ) == pseudorand_seq_10000   )
   assert( rprand( 1e5, 1e5, 0 ) == pseudorand_seq_100000  )
   assert( rprand( 1e6, 1e6, 0 ) == pseudorand_seq_1000000 )

Appendix B: low-level memory representation of the PRN seed 

The seed of the PRN generator is the IEEE754 double precision floating-point value approximating the mathematical constant π.

The following codelets are expressed in GNU Octave language (the assertions check necessary but not sufficient conditions):

   assert( pi == 3.1415926535897932 )

   pi_bytecode =          double( typecast( pi, 'uint8' ) )  ;
   pi_hexcode  = dec2hex( double( typecast( pi, 'uint8' ) ) );

where pi_bytecode value is (unsigned integers):


and pi_hexcode value is (hexadecimal values):


which can be reverted to the original double format by means of the codelet:

   typecast( uint8( hex2dec( pi_hexcode ) ), 'double' )

so to satisfy the constraint:

   assert( typecast( uint8( hex2dec( pi_hexcode ) ), 'double' ) == pi )

The described byte- and hexadecimal-code transformations can also be implemented by means of standard IO low-level functions. In GNU Octave/MATLAB languages the corresponding IO low-level codelet would be:

   tmp         = tempname;
   fid         = fopen( tmp , 'wb'    );
   fwrite( fid, pi, 'double', 0       , 'ieee-le' );
   fclose( fid );
   fid         = fopen( tmp , 'rb'    );
   pi_bytecode = fread( fid , 'uint8' );
   fclose( fid );
   delete( tmp );

This codelet explicitly set as endianness the little-endian convention. Provided attention is paid to endianness, the codelet is straightforward to be translated in other languages (C, C++, ...) for which at least the scalar version of the standard functions fopen, fwrite, fread, fclose is available. This way, the exact binary equivalence of the floating point seed (pi) within a given computational environment can be checked irrespective of the language, operating system and computer architecture of choice.

A successful compliancy check of the seed's low-level representation should be recommended as a necessary (but not sufficient) condition because the pseudo-random reference sequences are of double-precision floating point values and discrepancies (especially loss of precision) in the way values are loaded in the memory may lead to irreproducible results (e.g. when dealing with PRN-influenced chaotic systems).