figshare
Browse

PhDat

Download (29.2 MB)
Version 2 2025-06-05, 09:56
Version 1 2025-06-05, 09:55
dataset
posted on 2025-06-05, 09:56 authored by Felix RummelFelix Rummel, Patrick B. Warren, David J. Bray, Zeyneb Sumer, Jonathan Booth, Ardita Shkurti, Richard L. Anderson

PhDat. A dataset containing the phase behaviour for liquids. Currently (June 2025) limited to data from non-ionic surfactant / water binary mixtures.


The data is provided as a JSON file, and once loaded, data for any of the given surfactants can be retrieved using the record index or the SMILES string.


The data set is created from a grid of sample points extracted from a phase diagram image using the range of composition and temperature of the phase diagram and specifying the grid resolution. PhDat currently uses a common resolution of 1C and 1 wt %. We assign each sample point a probability of being in a particular phase state (i) according to P(i) = e−d(i)/2. Here d(i) is the minimum distance of a sample point to the phase i.


If a sample point is in the same phase as the phase being sampled for the distance to that phase is zero and as such P(i) = 1. If a sample point lies on a phase boundary it will be equally likely to be in adjacent phases. Finally, for each sample point, all probabilities below the threshold of 10−3 were set to zero for simplicity and the resulting probabilities were normalised to one. This results in a final output matrix, and therefore JSON file, where each particular temperature and composition point is associated with a vector of probability of being in each particular phase state found in the phase diagram.


Phase States


A total of 60 unique phase states (both one- and two-phase regions) have been identified that define the particular phase in a region of the phase diagram. In single-phase regions these are entries such as Isotropic (L1), Hexagonal (H1) and Lamellar (La). For two-phase regions such phase states could be W+L1 to describe a combined region where water and L1 phase coexist in a phase separated state.


Single-phase region label descriptions

  • W, E - Water or sub-micellar solution, and ice
  • L1, L2 - Isotropic micellar solutions (normal and reversed)
  • H1, H2 - Hexagonal phases (normal and reversed)
  • I1, I2 - Cubic micellar phases
  • V1, V2, V2i, V2p - Bicontinuous cubic phases
  • La, Lb - Lamellar phases, Lα and Lβ (liquid and gel)
  • X, X1, X2 - Solid surfactant phases (differing by hydration state)
  • L3 - Sponge phase
  • N1 - Nematic liquid of rod- or worm-like micelles
  • U - Unmeasured, unknown and/or unclear region or phase state

Two-phase region label descriptions are comprised from a combination of the single-phase labels. Two-phase coexistence regions are indicated by a ‘+’ between the corresponding phases, with the phase labels ordered alphabetically for convenience (e.g. W+X1 rather than X1+W). Not all combinations are encountered since the two-phase regions must occur between single phase regions, and in any given phase diagram the phase sequence is strictly ordered.


Data set structure


The JSON file is structured as a list of records, indexed by a data record entry number. Each record contains data from one unique source, organized as a dictionary comprising:

  • the SMILES string,
  • the state of the diagram (either complete or incomplete if some areas are unknown),
  • the name of the chemical compound,
  • the source (e.g. the citation reference to the paper) and its figure location in the source (e.g. the figure number or page number),
  • the purity of the chemical (if given),
  • the measurement methodology (if given),
  • the keys for the data (header names)
  • the values (phase state probabilities) as a list for all data keys


Here the composition is always given as wt % (weight percent) of molecule such that 0 wt % is pure solvent (water in the current PhDat release) and 100 wt % is the pure molecule of interest (surfactants in the current PhDat release). Hence reading each column entry of the list of the set of data keys provides complete information on each discretized point of the diagram, e.g. its composition, temperature and the probability value (as a fraction) for each phase state. Note that this format allows for the same compound to have multiple records if there is more than one source for the phase diagram and one should not assume the SMILES strings are unique (although this is the case for the present dataset).



Further details


PhDat has been developed by the STFC Hartree Centre and is made available under the CC-BY 4.0 license. Digitisation of the phase diagrams has been achieved via the use of CurveClaw. CurveClaw is a bespoke program used for the semi-automated extraction of phase diagram data into digital (numerical) form. CurveClaw is available under BSD 2-clause license from Github.


Further details of the process of data collection can be found in the associated publication contained within this project.


The authors of PhDat are happy to receive feedback on the data set and additional data which we will work into the main data set once evaluating the supplied data. Any feedback or additonal data can be sent to felix.rummel@stfc.ac.uk and richard.anderson@stfc.ac.uk.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC