In‐place query driven big data platform: Applications to post processing of environmental monitoring

This paper describes the use of an experimental big data platform for applications of environmental monitoring associated with visualization of global climate forecast data and air quality model simulation and response. Environmental monitoring in general requires both capabilities of model simulation for forecast, and data processing for visualization and analyses. The in‐place query driven big data platform, based on concepts of Query Driven Visualization and shared‐nothing distributed database, thus is developed for the need. The system architecture of this experimental big data platform entails one master data node and 17 slave data nodes, while the system links to the National Center for High‐performance Computing supercomputer, Advanced Large‐scale Parallel Supercluster, and storage pool. For software implementation, the openSUSE operating system and MariaDB database are installed on all nodes. The master data node is responsible for metadata management and information integration and the 17 slave data nodes for distributed database and parallel model simulation, data visualization, and analyses. The application of global climate data visualization (Outgoing Longwave Radiation or OLR, temperature, rainfall, etc.) in the platform serves first to partition Network Common Data Form file data into shared‐nothing distributed databases for partial visualization in slave data nodes, then integrated into whole visualization in the master node through Message Passing Interface communication.


Summary
This paper describes the use of an experimental big data platform for applications of environmental monitoring associated with visualization of global climate forecast data and air quality model simulation and response. Environmental monitoring in general requires both capabilities of model simulation for forecast, and data processing for visualization and analyses. The in-place query driven big data platform, based on concepts of Query Driven Visualization and shared-nothing distributed database, thus is developed for the need. The system architecture of this experimental big data platform entails one master data node and 17 slave data nodes, while the system links to the National Center for High-performance Computing supercomputer, Advanced Large-scale Parallel Supercluster, and storage pool. For software implementation, the openSUSE operating system and MariaDB database are installed on all nodes. The master data node is responsible for metadata management and information integration and the 17 slave data nodes for distributed database and parallel model simulation, data visualization, and analyses. The application of global climate data visualization (Outgoing Longwave Radiation or OLR, temperature, rainfall, etc.) in the platform serves first to partition Network Common Data Form file data into shared-nothing distributed databases for partial visualization in slave data nodes, then integrated into whole visualization in the master node through Message Passing Interface communication.
For the application of air quality management, we first accessed Taiwan Environmental Protection Administration (EPA) observed data in the master node. EPA observed data are replicated to distributed databases in slave nodes; and the air pollution model, Gaussian plume trajectory model, is replicated in all slave nodes for model simulation, which produces output data and associated image files in the local file system. The master node is able to collect whole image files through the remote shared file system for display of the results. We can see the approach of data I/O access in 2 applications, due to individual problem features, each application is unique. Examples of benchmark cases reveal strong performance in accelerating computing speed and reducing the I/O operational time. It is found that the platform is able to accelerate climate data visualization processes, help research scientists gain the deep insights into data, and explore the potential phenomena and features, such as formation of Typhoon eddies. In air quality management applications, the platform is used to perform the air pollution model Gaussian plume trajectory model.
Backward trajectory simulation of PM2.5 concentrations is used to identify the 30+ point source's contribution on 73 EPA monitor stations (receptors) in Taiwan. A user-friendly, web-service based big data presentation uses the heterogeneous observed and forecast pollutant data in space and time. The results support for air quality decision-making and emergency response. The limitation of data size for applications in the platform, the current users and future development of the platform, and the linkage of PRAGMA collaboration are also described in the paper.

| INTRODUCTION
The advances in high-performance computing allow scientists using numerical modeling to enhance their understanding of natural phenomena and pinpoint engineering design problems. Due to the rapid progress of computing power and storage capacity in computers, scientists are able to predict physical phenomena and gain deep insights into the unknowns. However, due to the growth of computing scale and physical complexity, the data sizes produced by numerical simulation has increased explosively. As a consequence, high-performance computing accompanied by large datasets produced by model simulation has created new challenges in data processing and analysis. Under these circumstances, building a system to meet requirements in both high-performance computing (HPC) and data management and analyses becomes very crucial. An HPC-based big data platform is 1 of solutions to solve this type of problem. 1 In general, numerical simulation of HPC uses domain decomposition methods to partition the computational domain, with the data in each domain being sent to separate computing nodes for computation.
The results of computation are then sent back and stored in the data storage. This is the traditional approach in which data are bundled by the computing node. One example is the data server and processing procedure used in the parallel visualization software ParaView 2 and VisIt. 3 However, the approach differs in big data management and analysis in that the data are distributed to local data nodes for necessary processing. This is because it is more effective to do data processing in the data nodes rather than migrate large dataset to computing nodes. Parallel and distributed visualization is an important tool for exploring the unknowns from large, diverse data produced by climate and environmental forecasts. Accordingly, an HPC-based big data platform will be helpful for data processing in climate and environmental applications.
Regarding the climate and environmental-related applications, the output data produced by model simulations have a grid-field structure 4 meaning that the data corresponding to spatial coordinates are coincident with those of computational grids. The grid-field structure is thus a structured data type. If we consider the programming adaptability of a numerical model, Hadoop streaming 5 can be used to convert an existing numerical program, such as the popular Fortran language heavily used in engineering, into JAVA code. However, it is timeconsuming to formulate Map-Reduce programming if the size of the existing numerical code is large. Consequently, Hadoop may not be ready for immediate application to climate data visualization and air quality model simulation problems. Basically, these problems are structured data problems, so a structured database (ie, relational Structured Query Language database) is sufficient for dealing with data processing in model simulation and visualization. Parallel processing of environmental big data for visualization is a challenging task. Data sizes can easily exceed hard memory limits. Some useful methods, such as In situ visualization 6 or Query Driven Visualization (QDV) [7][8][9] can be used in such case. In situ visualization implements a visualization function in the simulation model. The simulation produces a visualization, which can be saved in the output data of the model. The disadvantage of this approach is that it requires rerunning the model simulation if there is even a slight change in the visualization plan. The QDV method, in contrast, can be used to analyze existing data sets to read relevant data through the query of database.
Global climate forecasting produces large amounts of temporal and spatial data. The long-duration forecast, such as 45-day climate forecasts, can deliver climate phenomena longer than a 7-10 day forecast that weather agencies usually do, such that people can have more time to respond to changes in the weather. Visualization of climate data and overlaying onto earth geographical information system will make information more transparent and intuitive and provide greater context. Also, for air quality management and response in an area, it is crucial to assess the impact of air pollution caused by the geographically varied pollutant sources on locations of concern, ie, Environmental Protection Administration (EPA) monitoring stations. The uniqueness of climate data visualization and air quality model simulation in monitor stations is that it can be conducted by parallel processing without data communication among computing nodes. Therefore, the current proposed big data platform has adopted the concepts of QDV and shared-nothing distributed database, 10,11 deploying both local distributed database, distributing the overall data to local data nodes for parallel post-processing, and for model simulations and analyses.
The system architecture of the experimental big data platform then is used in accordance with the above operation. Performance benchmarks associated with computation and data processing, applications to climate data visualization, and air quality simulation and the response can illustrate the capabilities of the platform. The platform users, future development, and associated PRAGMA collaborations will be highlighted as well. System scale-up performance is nearly linear due to the sharednothing feature among data nodes. Therefore, upgrading the capacity of the hardware is straight forward when a production run is required in the future.

| Software and methodology
The openSUSE operating system 13 and the MariaDB database 14 are installed on the master node and 17 slave nodes. The core concept of this big data analysis platform is in-place computing. The data are the key for in-place computing. Application programs are executed at the location where data are stored. That means data are distributed and stored on hard drives of slave nodes through the distributed file system. Then the application program is distributed to the data node to read local data and perform analyses locally. Each node has its own database installed. When the data analysis program carries out its application on the data node, it simply refers to the local database where data are situated. However, before the previously mentioned processes are performed, the whole data set should be partitioned into blocks among the database of each data node. Currently, the data are partitioned to blocks according to the timestamp attributes of the data.
We use round robin as the policy for distributing blocks to slave nodes.

| Application planning
Two applications of parallel post-processing: visualization of global climate simulation data, and air quality simulation and response; are selected for use as case studies for performance benchmark tests.  The first case study of the platform application uses parallel postprocessing for the visualization of global climate forecast data. 15,16 Here, we have only selected limited cases for demonstration. The output data of climate model was in netCDF 17 format, which is popularly used in atmospheric science. The grid size of the output data is 1536 × 768 in space, and 248 steps in time. The attributes of the simulation output include OLR, precipitation, temperature, wind field, and many others. The whole data in the netCDF dataset can be partitioned into local nodes, based on either spatial or temporal partition. In this case study, temporal partition has been adopted. For example, the output of climate model simulation produces 248 time series of data sets in netCDF format, which can be treated as 248 blocks. These blocks are in turn allocated to local databases on each data node, as shown in Figure 3, while the database on the master server node stores metadata, which contains the information on the data partition status (such as on which slave node each data block is stored, attribute names, dimensions, etc.) and integrates the results of the analysis accomplished in each slave data node. Figure 4 shows the detailed processing workflow of the postprocessing for visualization task. A client-server framework is implemented, allowing the user to communicate with the master server node. The user client then proposes an analysis demand to the server. Based on the demand, the master server node retrieves metadata from its database and distributes the information to the slave nodes. Consequently, slave data node is able to understand what data are required by using the metadata from the master node, and to retrieve data by querying local databases, and by then performing necessary analysis/computation. Because the data are partitioned in accordance with time steps, each dataset is independent of the others, and the only communications needed are between master and slave nodes. The visualization of global climate forecast data starts with OLR data. For practical visualization of OLR data produced by Hiram model simulation, 15 the data are stored in the slave nodes. Each temporal OLR data set is analyzed on a slave node to compute contour lines, which are then collected and integrated on the master node. The master server node, therefore, can display each temporal image, as seen in  Table 1 shows the performance benchmark of 2 different cases. It takes~670 seconds to compute the OLR simulation data set sequentially, but only takes~58 seconds to do so on the present distributed parallel system. For a precipitation simulation data set, it took 13.2 seconds for sequential computation and only 2.3 seconds while using the distributed approach. The reason for the reduced time for a precipitation data set is because of the distributed nature, each data set is much smaller than the OLR data set.
For evaluating the I/O operation in the big data platform, we have also performed a case study, which let the distributed system query a single shared database (instead of distributed databases) to simulate a shared file system. It takes about 256 seconds for OLR visualization in this case; whereas, using the distributed system with distributed local database only takes 58 seconds. The reduction in total processing time from 256 seconds (shared database) to 58 seconds (distributed database) is largely due to reduced I/O operations. This is the primary advantage of "in-place computation" in the big data platform.
It is worth noting that, for purposes of this experiment, the current network interconnection is just 1 Gigabit Ethernet between the master node and the slave nodes. The performance could be even better if the interconnection were upgraded. The selected visualization results of   Figure 7); another displays the results overlaid on a 3-D earth (right in Figure 7).

| Visualization of tracking typhoon eddies
Using the global climate data visualization, research scientists are able to visualize weather phenomena based on large-scale forecast data.
For example, the proposed big data platform can track the eddy formation of a potential typhoon based on global climate OLR data

| Air quality simulation model
The air quality simulation model used in the big data platform is developed based on the GTx, which has been improved by Tsuang et al. 18,19 The advantage of GTx is that the source-receptor relationship can be

| Workflow of air quality model simulation
The workflow of air quality model simulation is shown in Figure 9. The first step is to develop a program to read and parse the EPA open data through the web service, and then to write the EPA data to the master database. Once the master database node is updated, all slave database nodes will be updated automatically by the database automatic replication mechanism. The database provided incremental replication to all data nodes. This means that only newly added records are replicated to slave nodes.
The GTx simulation model, as introduced above, is installed on each slave node, and EPA monitoring data is also replicated to each slave node. Then, the GTx simulation model can read local data and execute simulation without needing to read input data from other nodes. The advantage of this in-place computation is that it saves time in moving huge data. All model simulation and analyses are conducted in the slave nodes, while web service of data presentation is executed in the master node, which collects the resultant data from the slave nodes. Figure 10 shows the detailed processing of GTx simulation model and the performance benchmark. The simulation model uses observed data from the past 2 days to predict air quality for the following day. 19 The air quality data observed at EPA monitoring station is stored in MariaDB. The "traj" subroutine reads EPA monitoring stations data as input to set up the initial condition. The output is stored on a local hard disk; then the GTx simulation is launched. This reads the traj's output as input for computation. The GTx's output is also stored on a local hard disk for "grads" subroutine to draw an output graph.

| Performance benchmark
Here, only 73 EPA monitoring stations are taken into consideration for assessment; data from the other 3 stations are not sufficient.
The 2007 wind field data observed by EPA are complete for the whole year. Therefore, the EPA 2007 observation wind field data are adopted FIGURE 9 Workflow of the air quality model simulation. EPA indicates Environmental Protection Agency FIGURE 10 Performance benchmark of sequential process, distributed process with shared file system, and distributed process with local file system (in-place I/O). Local HDD indicates Local hard disk drive; Remote HDD, Remote hard disk drive FIGURE 11 EPA monitoring stations: geolocation, IP camera and hourly air quality observed data. EPA indicates Environmental Protection Agency; IP, Internet Protocol for assessment in this study. In the benchmark, the first step is to repeat the sequential process by using the master node 73 times for the 73 EPA monitoring stations, this takes 540 minutes. Second, the 17 available slave nodes (each node has 4 cores and 8 threads) solve the same problem concurrently with a mounted shared file system at master node; this takes 50 minutes. Finally, we use the 17 slave nodes to solve the same problem concurrently with local database and file system, this takes only 10 minutes. The total processing time is therefore reduced from 50 minutes (shared file system: Remote hard disk drive) to 10 minutes (In Place I/O: Local hard disk drive) due to reduced I/O operations. That is the advantage of using in-place computation in the current developed big data platform.
Each slave node's computation result is a text file, which is stored on its local hard disk. A simple script has been created to parse the output text file, extract each point source's contribution value and then write back to the master node database. The master node database is used for web-based data visualization, which will be introduced in next section.  MariaDB is set up at slave cluster nodes, which are responsible for data storage and distributed parallel computing, and at the master data node, which is used for managing metadata and further processing the information obtained from the cluster nodes. The hardware and software system of the platform is easy to install. This platform currently is designed for large data and high-speed computation experiments. The system's memory capacity in total is around 500 GB.

| A user-friendly web-based data presentation
System scale-up performance is nearly linear due to the sharednothing feature among data nodes. Therefore, upgrading the capacity of hardware is straightforward when a production run is required in the future.

| Lessons learned
The in-place query driven big data platform illustrated in this paper is Taiwan; when the automated processing system of climate data visualization is deployed, it will allow more users in Taiwan, as well as the PRAGMA community, to be served.
The current big data platform has been used for air quality forecast and response analyses in Taichung City and the Yun-Lin County, Taiwan. Another application of the platform is to identify countermeasures, through a series of simulations and analyses, for minimizing air pollution induced by straw burning in areas of southern Taiwan.
Air quality management is an important ingredient in smart city development. In addition to the air quality simulation and response topic described here, future applications are planned, such as air pollu-