Portable pseudo-random reference sequences with Mersenne Twister using GNU Octave. Mastrave project technical report

2014-01-21T13:57:28Z (GMT) by Daniele de Rigo
<p> </p> <p>de Rigo, D. (2012). Portable pseudo-random reference sequences with Mersenne Twister using GNU Octave. Mastrave project technical report. FigShare Digital Science. doi: <a href="http://dx.doi.org/10.6084/m9.figshare.94593">10.6084/m9.figshare.94593</a></p> <p> </p> <p>Portable pseudo-random reference sequences with Mersenne Twister using GNU Octave</p> <p> </p> <p>Mastrave project technical report</p> <p> </p> <p>Daniele de Rigo</p> <p><strong><br></strong></p> <p><strong>Abstract:</strong> Computationally intensive numerical tasks such as those involving statistical resampling, evolutionary techniques or Monte Carlo based applications are known to require robust algorithms for generating large sequences of pseudo-random numbers (PRN). While several languages, libraries and computing environments offer suitable PRN generators, the underlying algorithms and parametrization widely differ. Therefore, easily replicating a certain PRN sequence generally implies forcing researchers to use a very specific language or computing environment, also paying attention to its version, possible critical dependencies or even operating system and computer architecture.</p> <p>Despite the awareness of the benefits of reproducible research is rapidly growing, the definition itself of “reproducibility” for PRN based applications may lead to diverging interpretations and expectations. Where the cardinality of PRN sequences needed for data to be processed is relatively moderate, the paradigm of reproducible research is in principle suitable to be applied not only to algorithms, free software, data and metadata (<em>classic reproducible research</em>), but also to the involved pseudo-random sequences themselves (<em>deep reproducible researc</em>h). This would allow not only the “typical” scientific results to be reproducible “except for PRN-related statistical fluctuations”, but also the exact results published by a research team to be independently reproduced by other scientists - without of course preventing sensitivity analysis with different PRN sequences, as even classic reproducible research should easily allow.</p> <p>However, finding reference sequences of pseudo random numbers suitable to enable such a deep reproducibility may be surprisingly difficult. Here, sequences eligible to be used as reference dataset of uniformly distributed pseudo-random numbers are presented. The dataset of sequences has been generated using Mersenne Twister with a period of 2<sup>^</sup><sup>19937</sup>-1, as implemented in GNU Octave (version 3.6.1) with the Mastrave modelling library. The sequences are available in plain text format and also in the format MATLAB version 7, which is portable in both GNU Octave and MATLAB computing environments. The plain text format uses a fixed number of characters per each PRN so allowing random access to sparse PRNs to be easily done in constant time without needing a whole file to be loaded. This straightforward solution is language neutral, with the advantage of enabling wide and immediate portability for the presented reference PRN dataset, irrespective of the language, libraries, computing environment of choice for the users.</p> <p><strong><br></strong></p> <p><strong>Naming conventions:</strong></p> <p>Each file</p> <p><em>    pseudorand_seq_[N].[ext]</em></p> <p>contains a sequence of N pseudo-random numbers, uniformly distributed (generated using a Mersenne Twister with a period of 2<sup>^</sup><sup>19937</sup>-1, as implemented in GNU Octave version 3.6.1). </p> <p>The extension may be “txt” for the pure text sequence of PRN (35 characters – including the endline one – and one PRN per each line) or “mat” for the corresponding format MATLAB version 7 (containing a structure with two fields: the filed “values” contains N numerical PRN in double precision; the field “string” contains a matrix of characters witn N rows and 24 columns – the endline character being omitted – the last one fulfilling the constraint to only contain digits whose value is “0” ). </p> <p><br></p> <p><strong>Download (permanent URLs aside from the ones provided at </strong><a href="http://dx.doi.org/10.6084/m9.figshare.94593">http://dx.doi.org/10.6084/m9.figshare.94593</a><strong>):</strong></p> <p><br><a href="http://mastrave.org/doc/refdata/fs94593/10.txt">http://mastrave.org/doc/refdata/fs94593/10.txt</a>      pseudorand_seq_10.txt<br><a href="http://mastrave.org/doc/refdata/fs94593/100.txt">http://mastrave.org/doc/refdata/fs94593/100.txt</a>     pseudorand_seq_100.txt<br><a href="http://mastrave.org/doc/refdata/fs94593/1000.txt">http://mastrave.org/doc/refdata/fs94593/1000.txt</a>    pseudorand_seq_1000.txt<br><a href="http://mastrave.org/doc/refdata/fs94593/10000.txt">http://mastrave.org/doc/refdata/fs94593/10000.txt</a>   pseudorand_seq_10000.txt<br><a href="http://mastrave.org/doc/refdata/fs94593/100000.txt">http://mastrave.org/doc/refdata/fs94593/100000.txt</a>  pseudorand_seq_100000.txt<br><a href="http://mastrave.org/doc/refdata/fs94593/1000000.txt">http://mastrave.org/doc/refdata/fs94593/1000000.txt</a> pseudorand_seq_1000000.txt<br></p> <p><br><a href="http://mastrave.org/doc/refdata/fs94593/10.mat">http://mastrave.org/doc/refdata/fs94593/10.mat</a>      pseudorand_seq_10.mat<br><a href="http://mastrave.org/doc/refdata/fs94593/100.mat">http://mastrave.org/doc/refdata/fs94593/100.mat</a>     pseudorand_seq_100.mat<br><a href="http://mastrave.org/doc/refdata/fs94593/1000.mat">http://mastrave.org/doc/refdata/fs94593/1000.mat</a>    pseudorand_seq_1000.mat<br><a href="http://mastrave.org/doc/refdata/fs94593/10000.mat">http://mastrave.org/doc/refdata/fs94593/10000.mat</a>   pseudorand_seq_10000.mat<br><a href="http://mastrave.org/doc/refdata/fs94593/100000.mat">http://mastrave.org/doc/refdata/fs94593/100000.mat</a>  pseudorand_seq_100000.mat<br><a href="http://mastrave.org/doc/refdata/fs94593/1000000.mat">http://mastrave.org/doc/refdata/fs94593/1000000.mat</a> pseudorand_seq_1000000.mat<br></p> <p><br> </p> <p><strong>MD5 checksums</strong> ( <a href="http://mastrave.org/doc/refdata/fs94593/md5">http://mastrave.org/doc/refdata/fs94593/md5</a> ):</p> <p>d25fd2747eea3c1ab0aa81ef64aa7769        pseudorand_seq_10.txt<br>a26711dfc2fafa7fac3dc0d1cc6472cd        pseudorand_seq_100.txt<br>eb54b41e8dd7c946a9799342c27f4c90        pseudorand_seq_1000.txt<br>bdf6b4afd237ccf0fe4a97bf5c847f1d        pseudorand_seq_10000.txt<br>20b8398775a93e59533e9f14ba402caa        pseudorand_seq_100000.txt<br>00b1d615be143d5dd093ba6bd066a833        pseudorand_seq_1000000.txt </p> <p> </p> <p><strong>SHA1 checksums</strong> ( <a href="http://mastrave.org/doc/refdata/fs94593/sha1">http://mastrave.org/doc/refdata/fs94593/sha1</a> ):</p> <p>837e0f2793d1b0767f6eb03868a7e081ea530073    pseudorand_seq_10.txt<br>2bb59e7cc5369eb896df86061e720baaa4da1d96    pseudorand_seq_100.txt<br>79ae593da0b052bf11713155cd6e4f7d3906baff    pseudorand_seq_1000.txt<br>e23bd2e157f5447096f36e704a84aa77224ab9ae    pseudorand_seq_10000.txt<br>3b4a088752cebee4158a8cebfdd88193c4a03872    pseudorand_seq_100000.txt<br>aa9fb09d59c9b331fbba2360c345e47ede032b0e     pseudorand_seq_1000000.txt</p> <p><br><br>The Mastrave modelling library offers the module<strong> rprand</strong> (acronym for  <em><strong>r</strong>eproducible <strong>p</strong>seudo-<strong>rand</strong>om generator</em>) which allows all the PRN reference sequences to be exactly reproduced in a variety of versions of GNU Octave and MATLAB computing environments, by silently downloading and importing the corresponding published reference files. <br><br>The PRN sequences can also be easily imported and used in a variety of languages, operating systems and computer architectures (see Appendix B for a further discussion). This is straightforward to do by directly importing the plain text files listing the PRN reference sequences in <em>csv</em> format.<br><br>Therefore, users and applications of the published PRN reference sequences may not be directly interested in their exact numerical reproducibility. The exact numerical reproducibility by re-generation of the PRN sequences (<em>deep reproducible research</em>) is the subject of the Appendix A.<br></p> <p><br><br><strong>Appendix A: reproducing the PRN reference sequences within GNU Octave</strong> <br><br><br>The proposed sequences can be numerically reproduced within the GNU Octave computing environment. The GNU Octave version with which the sequences have been generated is the version 3.6.1. <br>Alternative strategies for generating reference PRN sequences could have implied dedicated source code to be released as an autonomous free software package.<br><br>This possible strategy was discarded due to the need for the PRN sequences to be as reliable as possible so to offer long-term general reusability. Extensive testing of both PRN generators' algorithms and implementations is of obvious importance. While systematic, exhaustive testing of all aspects of nontrivial code is almost impossible, linking this phase to already established free software numerical packages with broad diffusion and endorsement among computational scientists is perhaps the safest way for mitigating the risk of generating unexpectedly biased PRN sequences.<br><br>GNU Octave is a well-established and widely used environment for computational science applications. It is also free software and part of the GNU project, which is one of the most relevant free software projects. GNU Octave belongs to the GNU list of high priority projects, where its development is supported as high-level language for numerical computations. This further corroborates the selection of GNU Octave as a suitable free software environment for ensuring the long-term availability to run the PRN sequences' source code.<br><br>The following codelet generates the double precision floating-point PRN reference sequences (GNU Octave language):<br><br>   assert( exist(  'OCTAVE_VERSION'         ) )<br>   assert( isequal( OCTAVE_VERSION, '3.6.1' ) )<br>   <br>   <em>% Initializating the Mersenne Twister PRN generator</em><br>   rand( 'seed', pi )<br>   rand( 1000, 1000 );<br>   <br>   <em>% Generating the reference sequences</em><br>   pseudorand_seq_10      = rand( 10 , 1 );<br>   pseudorand_seq_100     = rand( 1e2, 1 );<br>   pseudorand_seq_1000    = rand( 1e3, 1 );<br>   pseudorand_seq_10000   = rand( 1e4, 1 );<br>   pseudorand_seq_100000  = rand( 1e5, 1 );<br>   pseudorand_seq_1000000 = rand( 1e6, 1 );<br><br><br>Compliancy checks (GNU Octave language: the Mastrave modelling library is also required):<br>   <br>   assert( rprand( 10 , 10 , 0 ) == pseudorand_seq_10      )<br>   assert( rprand( 1e2, 1e2, 0 ) == pseudorand_seq_100     )<br>   assert( rprand( 1e3, 1e3, 0 ) == pseudorand_seq_1000    )<br>   assert( rprand( 1e4, 1e4, 0 ) == pseudorand_seq_10000   )<br>   assert( rprand( 1e5, 1e5, 0 ) == pseudorand_seq_100000  )<br>   assert( rprand( 1e6, 1e6, 0 ) == pseudorand_seq_1000000 )<br><br><br><br><strong>Appendix B: low-level memory representation of the PRN seed</strong>  <br><br></p> <p>The seed of the PRN generator is the IEEE<strong></strong>754 double precision floating-point value approximating the mathematical constant <strong>π</strong>.</p> <p>The following codelets are expressed in GNU Octave language (the assertions check necessary but not sufficient conditions):</p> <p><br>   assert( pi == 3.1415926535897932 )<br><br>   <strong>pi_bytecode</strong> =          double( typecast( pi, 'uint8' ) )  ;<br>   <strong>pi_hexcode<em> </em></strong> = dec2hex( double( typecast( pi, 'uint8' ) ) );<br><br><br>where <strong>pi_bytecod</strong><strong>e</strong> value is (unsigned integers):<br><br>    24<br>    45<br>    68<br>    84<br>   251<br>    33<br>     9<br>    64<br><br><br>and <strong>pi_hexcode</strong> value is (hexadecimal values):<br><br>    18<br>    2D<br>    44<br>    54<br>    FB<br>    21<br>    09<br>    40<br><br><br>which can be reverted to the original double format by means of the codelet:<br><br>   typecast( uint8( hex2dec( pi_hexcode ) ), 'double' )<br><br>so to satisfy the constraint:<br><br>   assert( typecast( uint8( hex2dec( pi_hexcode ) ), 'double' ) == pi )<br>   <br><br>The described byte- and hexadecimal-code transformations can also be implemented by means of standard IO low-level functions. In GNU Octave/MATLAB languages the corresponding IO low-level codelet would be:<br><br>   tmp         = tempname;<br>   fid         = fopen( tmp , 'wb'    ); <br>   fwrite( fid, pi, 'double', 0       , 'ieee-le' ); <br>   fclose( fid );<br>   fid         = fopen( tmp , 'rb'    );<br>   pi_bytecode = fread( fid , 'uint8' );<br>   fclose( fid );<br>   delete( tmp );<br><br>This codelet explicitly set as endianness the little-endian convention. Provided attention is paid to endianness, the codelet is straightforward to be translated in other languages (C, C++, ...) for which at least the scalar version of the standard functions fopen, fwrite, fread, fclose is available. This way, the exact binary equivalence of the floating point seed (pi) within a given computational environment can be checked irrespective of the language, operating system and computer architecture of choice.<br><br>A successful compliancy check of the seed's low-level representation should be recommended as a necessary (but not sufficient) condition because the pseudo-random reference sequences are of double-precision floating point values and discrepancies (especially loss of precision) in the way values are loaded in the memory may lead to irreproducible results (e.g. when dealing with PRN-influenced chaotic systems).<br><br><br></p>