Portable pseudo-random reference sequences with Mersenne Twister using GNU Octave. Mastrave project technical report
de Rigo, D. (2012). Portable pseudo-random reference sequences with Mersenne Twister using GNU Octave. Mastrave project technical report. FigShare Digital Science. doi: 10.6084/m9.figshare.94593
Portable pseudo-random reference sequences with Mersenne Twister using GNU Octave
Mastrave project technical report
Daniele de Rigo
Abstract: Computationally intensive numerical tasks such as those involving statistical resampling, evolutionary techniques or Monte Carlo based applications are known to require robust algorithms for generating large sequences of pseudo-random numbers (PRN). While several languages, libraries and computing environments offer suitable PRN generators, the underlying algorithms and parametrization widely differ. Therefore, easily replicating a certain PRN sequence generally implies forcing researchers to use a very specific language or computing environment, also paying attention to its version, possible critical dependencies or even operating system and computer architecture.
Despite the awareness of the benefits of reproducible research is rapidly growing, the definition itself of “reproducibility” for PRN based applications may lead to diverging interpretations and expectations. Where the cardinality of PRN sequences needed for data to be processed is relatively moderate, the paradigm of reproducible research is in principle suitable to be applied not only to algorithms, free software, data and metadata (classic reproducible research), but also to the involved pseudo-random sequences themselves (deep reproducible research). This would allow not only the “typical” scientific results to be reproducible “except for PRN-related statistical fluctuations”, but also the exact results published by a research team to be independently reproduced by other scientists - without of course preventing sensitivity analysis with different PRN sequences, as even classic reproducible research should easily allow.
However, finding reference sequences of pseudo random numbers suitable to enable such a deep reproducibility may be surprisingly difficult. Here, sequences eligible to be used as reference dataset of uniformly distributed pseudo-random numbers are presented. The dataset of sequences has been generated using Mersenne Twister with a period of 2^19937-1, as implemented in GNU Octave (version 3.6.1) with the Mastrave modelling library. The sequences are available in plain text format and also in the format MATLAB version 7, which is portable in both GNU Octave and MATLAB computing environments. The plain text format uses a fixed number of characters per each PRN so allowing random access to sparse PRNs to be easily done in constant time without needing a whole file to be loaded. This straightforward solution is language neutral, with the advantage of enabling wide and immediate portability for the presented reference PRN dataset, irrespective of the language, libraries, computing environment of choice for the users.
contains a sequence of N pseudo-random numbers, uniformly distributed (generated using a Mersenne Twister with a period of 2^19937-1, as implemented in GNU Octave version 3.6.1).
The extension may be “txt” for the pure text sequence of PRN (35 characters – including the endline one – and one PRN per each line) or “mat” for the corresponding format MATLAB version 7 (containing a structure with two fields: the filed “values” contains N numerical PRN in double precision; the field “string” contains a matrix of characters witn N rows and 24 columns – the endline character being omitted – the last one fulfilling the constraint to only contain digits whose value is “0” ).
Download (permanent URLs aside from the ones provided at http://dx.doi.org/10.6084/m9.figshare.94593):
MD5 checksums ( http://mastrave.org/doc/refdata/fs94593/md5 ):
SHA1 checksums ( http://mastrave.org/doc/refdata/fs94593/sha1 ):
The Mastrave modelling library offers the module rprand (acronym for reproducible pseudo-random generator) which allows all the PRN reference sequences to be exactly reproduced in a variety of versions of GNU Octave and MATLAB computing environments, by silently downloading and importing the corresponding published reference files.
The PRN sequences can also be easily imported and used in a variety of languages, operating systems and computer architectures (see Appendix B for a further discussion). This is straightforward to do by directly importing the plain text files listing the PRN reference sequences in csv format.
Therefore, users and applications of the published PRN reference sequences may not be directly interested in their exact numerical reproducibility. The exact numerical reproducibility by re-generation of the PRN sequences (deep reproducible research) is the subject of the Appendix A.
Appendix A: reproducing the PRN reference sequences within GNU Octave
The proposed sequences can be numerically reproduced within the GNU Octave computing environment. The GNU Octave version with which the sequences have been generated is the version 3.6.1.
Alternative strategies for generating reference PRN sequences could have implied dedicated source code to be released as an autonomous free software package.
This possible strategy was discarded due to the need for the PRN sequences to be as reliable as possible so to offer long-term general reusability. Extensive testing of both PRN generators' algorithms and implementations is of obvious importance. While systematic, exhaustive testing of all aspects of nontrivial code is almost impossible, linking this phase to already established free software numerical packages with broad diffusion and endorsement among computational scientists is perhaps the safest way for mitigating the risk of generating unexpectedly biased PRN sequences.
GNU Octave is a well-established and widely used environment for computational science applications. It is also free software and part of the GNU project, which is one of the most relevant free software projects. GNU Octave belongs to the GNU list of high priority projects, where its development is supported as high-level language for numerical computations. This further corroborates the selection of GNU Octave as a suitable free software environment for ensuring the long-term availability to run the PRN sequences' source code.
The following codelet generates the double precision floating-point PRN reference sequences (GNU Octave language):
assert( exist( 'OCTAVE_VERSION' ) )
assert( isequal( OCTAVE_VERSION, '3.6.1' ) )
% Initializating the Mersenne Twister PRN generator
rand( 'seed', pi )
rand( 1000, 1000 );
% Generating the reference sequences
pseudorand_seq_10 = rand( 10 , 1 );
pseudorand_seq_100 = rand( 1e2, 1 );
pseudorand_seq_1000 = rand( 1e3, 1 );
pseudorand_seq_10000 = rand( 1e4, 1 );
pseudorand_seq_100000 = rand( 1e5, 1 );
pseudorand_seq_1000000 = rand( 1e6, 1 );
Compliancy checks (GNU Octave language: the Mastrave modelling library is also required):
assert( rprand( 10 , 10 , 0 ) == pseudorand_seq_10 )
assert( rprand( 1e2, 1e2, 0 ) == pseudorand_seq_100 )
assert( rprand( 1e3, 1e3, 0 ) == pseudorand_seq_1000 )
assert( rprand( 1e4, 1e4, 0 ) == pseudorand_seq_10000 )
assert( rprand( 1e5, 1e5, 0 ) == pseudorand_seq_100000 )
assert( rprand( 1e6, 1e6, 0 ) == pseudorand_seq_1000000 )
Appendix B: low-level memory representation of the PRN seed
The seed of the PRN generator is the IEEE754 double precision floating-point value approximating the mathematical constant π.
The following codelets are expressed in GNU Octave language (the assertions check necessary but not sufficient conditions):
assert( pi == 3.1415926535897932 )
pi_bytecode = double( typecast( pi, 'uint8' ) ) ;
pi_hexcode = dec2hex( double( typecast( pi, 'uint8' ) ) );
where pi_bytecode value is (unsigned integers):
and pi_hexcode value is (hexadecimal values):
which can be reverted to the original double format by means of the codelet:
typecast( uint8( hex2dec( pi_hexcode ) ), 'double' )
so to satisfy the constraint:
assert( typecast( uint8( hex2dec( pi_hexcode ) ), 'double' ) == pi )
The described byte- and hexadecimal-code transformations can also be implemented by means of standard IO low-level functions. In GNU Octave/MATLAB languages the corresponding IO low-level codelet would be:
tmp = tempname;
fid = fopen( tmp , 'wb' );
fwrite( fid, pi, 'double', 0 , 'ieee-le' );
fclose( fid );
fid = fopen( tmp , 'rb' );
pi_bytecode = fread( fid , 'uint8' );
fclose( fid );
delete( tmp );
This codelet explicitly set as endianness the little-endian convention. Provided attention is paid to endianness, the codelet is straightforward to be translated in other languages (C, C++, ...) for which at least the scalar version of the standard functions fopen, fwrite, fread, fclose is available. This way, the exact binary equivalence of the floating point seed (pi) within a given computational environment can be checked irrespective of the language, operating system and computer architecture of choice.
A successful compliancy check of the seed's low-level representation should be recommended as a necessary (but not sufficient) condition because the pseudo-random reference sequences are of double-precision floating point values and discrepancies (especially loss of precision) in the way values are loaded in the memory may lead to irreproducible results (e.g. when dealing with PRN-influenced chaotic systems).