A novel Python module for statistical analysis of turbulence (P-SAT) in geophysical flows

We present Python Statistical Analysis of Turbulence (P-SAT), a lightweight, Python framework that can automate the process of parsing, filtering, computation of various turbulent statistics, spectra computation for steady flows. P-SAT framework is capable to work with single as well as on batch inputs. The framework quickly filters the raw velocity data using various methods like velocity correlation, signal-to-noise ratio (SNR), and acceleration thresholding method in order to de-spike the velocity signal of steady flows. It is flexible enough to provide default threshold values in methods like correlation, SNR, acceleration thresholding and also provide the end user with an option to provide a user defined value. The framework generates a .csv file at the end of the execution, which contains various turbulent parameters mentioned earlier. The P-SAT framework can handle velocity time series of steady flows as well as unsteady flows. The P-SAT framework is capable to obtain mean velocities from instantaneous velocities of unsteady flows by using Fourier-component based averaging method. Since P-SAT framework is developed using Python, it can be deployed and executed across the widely used operating systems. The GitHub link for the P-SAT framework is: https://github.com/mayank265/flume.git.


Definitions of various statistical parameters of turbulence for steady flow.
Since the tool calculates several statistical parameters from raw data, it is better if these statistical parameters are defined first. All the notation relating to the statistical parameters are depicted in Table 1. Time-averaged streamwise (U), lateral (V), and vertical (W) velocities have been calculated as: where, U i , V i , and W i are the instantaneous velocities in the streamwise, lateral and vertical directions, respectively, and n is the number of samples taken. Velocity variance for all three components of the velocity can be given by: www.nature.com/scientificreports/ where u ′ , v ′ and w ′ are the fluctuating components of velocities in the streamwise, lateral, and vertical directions, respectively. The square root of the velocity variance is the standard deviation. The shape of the probability density function is illustrated by its skewness. The skewness also depicts the relative contribution of positive and negative velocity fluctuations to the formation of the velocity pattern. Skewness for the velocities in all three directions can be given by: www.nature.com/scientificreports/ Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values. Kurtosis of the probability distribution of a real-valued random variable (3D velocity in the present case) is given by: Reynolds stresses furnish very important information about the transfer of momentum in turbulent flows. Reynolds stresses are the components of a symmetric second order tensor where the diagonal components are called the Reynolds normal stresses (RNS) and the off diagonal components are called the Reynolds shear stresses (RSS) 17 . Reynolds shear stresses ( τ uw , τ uv , and τ vw ) can be calculated as: where, ρ water is the density of water. The degree of flow anisotropy is measured by the ratio, σ w /σ u and can be given by following expression: where, σ w and σ u are the standard deviations of the vertical and streamwise velocities, respectively. The third statistical moments or the skewness indicate non-symmetric distributions. Zero skewness indicate symmetric distribution about the mean or the Gaussian distribution, while negative and positive values of skewness show that the distribution is skewed towards left and right, respectively, around the mean. According to Raupach 18 , the third moments of velocity fluctuations, M jk = u j w k , wherej + k = 3 , and u = u ′ u ′ u ′ 0.5 , w = w ′ w ′ w ′ 0.5 can be expressed as: . At any point in the flow field, the contribution to the total Reynolds shear stress through different ways of momentum transfer can be calculated as: where angle brackets correspond to conditional averaging, T is the sampling time, and δ i,H is the indicator function. The definition of the indicator function can be given as: where H is the parameter defined by the hyperbolic hole region 20 which allows the investigation of larger contribution to the total Reynolds shear stress from various quadrants. Fractional contribution to the total Reynolds shear stress from different quadrants can be defined as: S i,H is negative for outward and inward interactions (i = 1, 3) and is positive for ejections and sweeps (i = 2, 4) . Eq. (26) implies that for H = 0 when hole size disappears, sum of the fractional contributions from all the quadrants equals to unity ( Octant analysis is advantageous when the user wishes to analyze turbulent structures with a strong three dimensionality. Octant analysis is carried out by considering all three components of velocity fluctuations ( Table 2).
2D fluxes of the turbulent kinetic energy in the streamwise ( 2Df TKEu ) and vertical ( 2Df TKEw ) directions 18 can be calculated as: Further, these 2D turbulent kinetic energy fluxes have been made non-dimensional by dividing them by the shear velocity ( U * ): 3D fluxes of the turbulent kinetic energy in the streamwise ( 3Df TKEu ) and vertical ( 3Df TKEw ) directions can be calculated as: Table 2. Determination of octant. The turbulent kinetic energy (TKE) is defined as half the sum of the variances of the velocity components and can be calculated as:

Octant name
Dissipation of the turbulent kinetic energy (e) and its non-dimensional form (ED) can be calculated by the following expressions: The P-SAT framework also allows user to compute the power spectra of filtered/unfiltered three-dimensional velocities by converting a time domain signal into frequency domain using a discrete Fast Fourier Transform (FFT). The spectrum of a filtered signal represents the mean square amplitude of that signal. In other words, spectrum shows the energy of a signal at any given frequency 21 . Energy in turbulence is received at large scales, and its dissipation occurs at small scales. Spectra allows us to think about the way, in which the energy is exchanged among eddies of different sizes.
Calculation of the mean velocity component from the instantaneous velocities for highly unsteady flow. In order to analyze structure of the unsteady flows, determination of the mean velocity component (U uf ) from the instantaneous velocities ( U uf ) is an essential parameter. From the definition point of view, instantaneous velocities of the unsteady flow can be decomposed into the mean velocities and fluctuating components in a following way: There are several methods available to find out the mean velocity component from the instantaneous velocities 22 . However, the Fourier-component method has been found most suitable for the determination of the mean velocity component in unsteady flows 23 . The determination of the mean velocities in unsteady flows using the Fourier-component method can be done as follows: Time dependent instantaneous velocities U ufi (where i = 1, 2, ..., n) are transformed into the frequency domain by using a discrete Fourier transform and only the frequency components lower than a cutoff frequency ( f cutoff ) are taken as the representative values of the mean velocities (U ufi ) as follows: where, and (31) www.nature.com/scientificreports/ for j = 0, 1, 2, ..., (k − 1)/2 . Here, n is the number of samples collected in the time period (T) of a measurement. Nezu et al. 22 have suggested to adopt the value of cutoff frequency ( f cutoff ) to be smaller than the burst frequency of turbulence, and they considered to select the number of Fourier components (k) as seven.

Experimental setup and measurements
The dataset used in the present study to test the P-SAT framework, has been obtained from the experimental study carried out by Deshpande and Kumar 24 . Experiments were performed in a glass-walled tilting flume of dimensions 20 m X 1 m X 0.72 m (length X width X depth). A schematic diagram of the flume has been provided in Fig. 1. An upstream collection tank of dimensions 2.8 m X 1.5 m X 1.5 m (length X width X depth) was provided with a couple of wooden baffles installed in it to quieten the flow before entering the channel. Uniform river sand of median diameter d 50 = 1.1 mm was used as bed material in the experiments. Table 3 provides information on various physical characteristics of bed material and experimental parameters.
Velocity measurements. Instantaneous velocity readings along the vertical plane were taken using a four-beam, down-looking, Vectrino+ acoustic Doppler velocimeter (ADV) probe manufactured by Nortek. The instrument collects data in a cylindrical remote sampling volume located at 5 cm below the central transmitter. The height of the sampling volume was set at 1 mm when measurements were taken very near the bed such that the sampling volume did not touch the particles on the bed surface, and at 4 mm when measurements were taken away from the bed. Data were collected at the center line of the channel cross-section at a distance of 8 m from the downstream end of the flume to minimize the effects of flow entrance and exit conditions on the measurement location. Measurements were carried out on multiple heights, and at each height, 30,000 samples were collected for 5 min (i.e., sampling rate of 100 Hz).

P-SAT configuration, usage, features, performance and limitations
Flow diagram of the P-SAT framework. The P-SAT framework has been designed for working with steady as well as unsteady flows. Figure 2 shows the flow diagram of the P-SAT framework. The complete flow can be divided into the following main components.
• Steady flow analysis 1. Spike detection and removal methods It consists of three Spike Detection and Removal Methods: (a) Velocity correlation filter, (b) Signal-to-Noise Ratio (SNR) filter, (c) Acceleration thresholding method 16 . The P-SAT framework de-spikes the contaminated data points obtained from the ADV and replaces them via interpolation between ends of spike method. Each of the above filtering method, requires taking an input threshold from the user for filtering purposes (a default value is also stored in the program in case the user does not wish to specify or is unsure, for example, the default value of Correlation Filter is set as 70). Once the input threshold is set, the P-SAT framework scans for the .dat files and processes them one by one. For each filtering method, it computes the spikes and replaces them by interpolation between the ends of spikes. The algorithm for the above three filtering methods along with replacements algorithm is explained in Sect. 4.4. After running each filtering method, the relevant output files are saved into .csv format by the P-SAT framework. 2. Turbulent statistics The filtering methods mentioned earlier, de-spikes the noisy raw signal and converts it into a clean signal. The filtered signal is then further used for the calculations of various turbulent statistical parameters such as: time-averaged 3D velocities, velocity variances, skewness, kurtosis, Reynolds shear stresses (RSS), third order correlations, 2D as well as 3D fluxes of the turbulent kinetic energy, turbulent kinetic energy dissipation, conditional statistics of quadrants, computation of octants 25 and their corresponding probabilities of occurrence. All these are also saved for each input file and upon every execution, these parameters are appended in the file, so that previous computation are intact with the user. 3. Computation of the spectral density functions of velocities This component is used for the computation of the spectral density functions for the given input velocities. Since the ADV provides 3D velocities, the P-SAT framework computes the spectra for the velocities in streamwise, lateral, and vertical www.nature.com/scientificreports/ directions. A final spectra that combines all the above spectra is also plotted for user convenience. The P-SAT framework saves both the data and visualization in individual files for future reference. 4. File organization The P-SAT framework creates '14' files for every input .dat file it reads (for steady flow data). In order to ensure that it does not clutter the workspace of the user, the P-SAT framework creates a folder and organizes all the saved file and appends timestamp to all files. This is convenient from the user point of view as the user can perform further computation on the saved files. It also deletes any temporary files that are created during the program execution.
• Unsteady flows 1. Determination of the mean velocity component As has been discussed in Sect. 2.2, several methods are available for the calculation of the mean velocity component from the instantaneous velocities in unsteady flow environments 22 . The P-SAT framework computes mean velocities using the Fouriercomponent based method, which has been found most suitable for the determination of the mean velocity component in unsteady flows 23 . The P-SAT framework can work with multiple files of unsteady flows specified in the "input_ensemble_files.txt ′′ . This file contains list of all those files that contain unsteady flow data. For each of the file, the P-SAT framework asks for the number of components and computes the mean velocity computation via Fourier Component based Averaging method. The user has to provide a .csv file having 4 columns: t, u, v, w representing time and the 3D velocities.
Importing data from raw files.
• Input files (*.dat) : P-SAT framework requires that the input should be in .dat format and the data should be comma separated. This restriction has been kept since it ensures that the large code is easily managed. There are various tools available that can help to convert a space/tab separated file(s) into a comma separated format. Table 4 shows a sample .dat file obtained from the Vectrino+ software. When the raw velocity data is collected in any environmental flow experiment using a Nortek made Vectrino+ ADV and the Vectrino+ software, a .vno file is generated. This .vno file is further processed by the Vectrino+ software and five more files are created with .adv, .dat, .hdr, .pck, and .ssl extensions. The .dat files are standard ASCII files that contain velocity time series data obtained from Vectrino+ ADV. Typically the .dat files obtained from this software consists of the data as shown in Table 4. The columns as explained below: -Time Contains the time information, depending on the sampling frequency. In this case the sampling frequency was 100 Hz and so the time interval changes every 1 100 s. -SL Routine serial counter from 1 till n.
-Counter A value provided by Vectrino software. It is not required by the P-SAT framework.
represent the three dimensional velocity data collected by the ADV. W1 i is a redundant component of W and is not required for the current study. These velocities are raw signal (contains spikes) that need filtering methods to de-spike the noisy data. It is important for the P-SAT framework that the columns of the input .dat files must be exactly in the same order for the successful execution.

Dataset processing.
A key component of the P-SAT framework is that it uses the Numpy arrays that loads the .dat file efficiently into the memory. The Numpy module enables us to easily import specific columns from De-spiking and replacement algorithms. The P-SAT framework implements three de-spiking method: (a) Correlation Filter (shown in Algorithm 1), (b) SNR Filter (shown in Algorithm 2), (c) Acceleration Thresholding Filtering (shown in Algorithm 3).
We will explain the algorithm for detecting spikes using Correlation method. The others filtering methods can be explained similarly. The input for Correlation Method Algorithm are the 3D velocities measured using ADV, U i ; V i ; and W i represents instantaneous velocities in the streamwise, lateral and vertical directions, respectively, and their corresponding correlation values (Corr-U i , Corr-V i , Corr-W i ). For every row (Line 1, Algorithm 1) that is read by the P-SAT framework (a sample row consists of the data shown in Table 4), it checks if any of the correlation values for U i ; V i ; and W i is less than the Correlation Threshold (Line 2, Algorithm 1), the corresponding velocity point is marked as spike (Line 3, Algorithm 1). The P-SAT framework provides flexibility: it provides default values for correlation threshold and also provides an option to allow the end user to enter a custom value for correlation threshold. The spike is replaced by the interpolation between the ends of spike (Line 4, Algorithm 1). The replacement algorithm is depicted in Algorithm 4. If the correlation of all the velocities are above the correlation threshold, the velocities are untouched (as they are not spike) and the next row is processed (Line 6, Algorithm 1).
The other two filtering methods are described below.
SNR filter Similar to the correlation filter but uses SNR (Signal-to-Noise ratio) values (using values of SNR-U i , SNR-V i , SNR-W i as explained earlier). www.nature.com/scientificreports/ Acceleration thresholding method We have used the method as proposed in Goring and Nikora 16 . The acceleration thresholding method first detects the spikes and then replaces them in two phases. First phase detection is done for the negative accelerations (Lines 1-4, Algorithm 3), while the second phase detection is performed on positive accelerations (Lines 6-8, Algorithm 3). In each of the phase, there are multiple passes made through the raw data until all the data points of the given sample conform to the acceleration criteria. General formula for acceleration is a i = U i −u i−1 t . For negative accelerations all those points where a i < − a g , are marked as spike and replaced by linear interpolation between the ends of spike. The process is repeated until no more negative spikes are left. The same procedure is repeated for the positive accelerations. However, the check for positive spikes is a i > a g . The replacement strategy continues to be the same as that applied for the negative spikes (linear interpolation between the ends of spikes). The criteria is a g , where a is a user defined value and g is the universal gravitational constant. Goring and Nikora 16 mention that values for a should be kept between 1 and 1.5.
Finally, the spike replacement used is linear interpolation between ends of spike and is shown below. Given a spike data point, the algorithm takes the last good point that was not identified as spike and the next good point that was not identified as spike and performs spike replacement via linear interpolation between last good point and the next good point. Figure 3 shows the raw velocity signal captured by the ADV and filtered signals by the P-SAT framework in a stepwise manner. The raw velocity signal at a perticular measurement location has been depicted in Fig. 3A. Whereas, the raw signal filtered by using the SNR method, velocity correlation method, acceleration thresholding method, and by the combination of all the methods mentioned above are shown in Fig. 3B-E, respectively.

Sample code execution.
For testing the P-SAT framework, we took the raw .dat files generated by the Vectrino+ software. A sample row from the raw file is displayed is in Table 4. For user convenience we have presented the sample code which is ready for the execution. After downloading the entire GitHub repository (https ://githu b.com/mayan k265/flume .git), the user just needs to extract the zip archive (Password: "PoTs_Turbulence" (password does not contain quotes)) execute python3 pots_module.py . We have taken two sample .dat files and have shown the execution of the P-SAT framework. All .dat files that need to be processed must be in the same folder where the pots_module.py is placed. Also the input_files.txt and input_files_corresponding_ depths.txt should be in same folder and should be correctly mapped (see Fig. 4). The P-SAT framework creates a Logs folder that contains all the errors that were recorded during the execution of the P-SAT framework. This can help for the debugging purposes. www.nature.com/scientificreports/ External libraries. The P-SAT framework requires the following libraries for the successful execution.
• .csv 26 Useful for writing files into .csv format. The final analysis is written into a .csv file.
• datetime Computes datetime for generation of various time format. In order to prevent conflicting filenames, we append a timestamp format to every file that is created. • glob Organizing files and folders using regular expression patterns.  www.nature.com/scientificreports/ • numpy 27 The core library for manipulating the .dat files and applying various operations on them. Numpy processes the .dat files efficiently by converting them to numpy arrays. • scipy Provides libraries for various statistical parameters like kurtosis, skewness etc.
• os All the filtered files are stored in a "Filtered_Timestamp" folder which is generated by the os module.
• timeit For computing the code execution time.
The user can use pip install package_name to install the libraries. The authors would recommend installing the Anaconda 29 environment which is cross platform package and installs all the above packages with the inbuilt installer.

Performance. The P-SAT framework is an open source module written in Python 3+ and is tested both on
Linux and Windows platforms. We have used few common libraries that are easily available. Python packages are easy to install using the standard Python package installer pip.
We tested the performance on the following system: OS Name: Microsoft Windows 10 Pro, OS Version: 10.0.17763 Build 17763, System Type: x64-based PC, Processor: Intel(R) Core(TM) i5-6200U CPU @ 2.30 GHz, RAM: 12 GB DDR4. We have executed the P-SAT framework and provided it with a set of 33 .dat files with their respective depths. The statistical parameters of turbulence obtained via P-SAT for these 33 files are shown in Tables 5,6,7,8,9 and 10. The table header is kept self-explanatory for easier understanding. Due to space constraint, the statistical parameters of turbulence obtained via P-SAT are split across multiple tables. However, the P-SAT framework stores all the statistical parameters of turbulence in a single file.
Experimental conditions are given as follows: The velocities were measured in an open channel flow experiment at the sampling rate of 100 Hz using the Nortek made Vectrino+ ADV. The ADV was suspended from a position in the flow in order to get the 3D velocities. The U i , V i , W i , represent the instantaneous velocities in the streamwise ( U i ), lateral ( V i ), and vertical ( W i ) directions, respectively. Each experiment was run for a period of 300 s. So in a given run, the total sample velocities obtained are 100 * 300 = 30, 000 . For 33 measurements the total number of sample readings we obtained is 30000 * 33 = 990, 000 values. The P-SAT framework takes each input .dat file as input, does the pre-processing, then applies three filters to de-noise the velocity time series data and then calculates various turbulent parameters. During this process the P-SAT framework generates .xls files for all input files, the files obtained after each filtering, and a final analysis file (Parameters.csv) having the turbulent characteristics of all the input files.
The execution time for the code was 12 min 22 s. The memory usage varied between 30 MB − 160 MB in a single execution. Considering the various parameters the P-SAT framework calculates, and the conversion it does, the time taken can be justified. During different runs the time taken may reduce or increase depending on the quality of the raw file. As seen in the acceleration thresholding method, the process needs to be repeated till all the spikes are removed. A noisy velocity time series data may have a large number of spikes taking more time, while a sample having few spikes is executed quickly. So, we can conclude that the P-SAT framework does not hog system resources and can be run even on a basic desktop machine.
Tables and visualizations from the P-SAT framework. As mentioned already, the P-SAT framework generates a rich set of turbulent statistics like mean velocities, variance, skewness, kurtosis, Reynolds stresses, third order correlations, 2D as well as 3D fluxes of turbulent kinetic energy, turbulent kinetic energy dissipation, conditional statistics of quadrants, computation of Octants and their corresponding probability. Based on the rich statistics obtained, we plotted few visualizations as shown below.
Tables 5, 6, 7, 8, 9 and 10 show all the turbulent statistics generated by the P-SAT framework. We have split the tables into multiple pages as all the turbulent statistics generated by the P-SAT framework cannot be displayed using a single table.
The visualization of the turbulent statistics shown in Tables 5, 6, 7, 8, 9 and 10 is depicted in Figs. 5a, 6, 7 and 8b. Time-averaged velocities in all three directions (U, V, and W) are plotted against flow depth in Fig. 5a. Vertical distributions of three components of velocity variances are shown in Fig. 5b. Fig. 5c, d depict the vertical distribution of velocity skewness, and velocity kurtosis, respectively. Reynolds shear stresses (off diagonal components of a symmetric second order tensor) are plotted against the flow depth in Fig. 5e. Vertical distribution of the flow anisotropy is shown in Fig. 5f. Figure 6a shows the distribution of the third statistical moments ( M 30 , M 03 , M 12 , and M 21 ) plotted against the flow depth. Vertical distributions of the turbulent kinetic energy dissipation (e) and its non-dimensional form (ED) are shown in Fig. 6b. 2D as well as 3D fluxes of the turbulent kinetic energy are plotted in Fig. 6c and d, respectively. Fractional contribution to the total Reynolds shear stress from different quadrants (quadrant analyses) for hole sizes (H) = 0, and 2 are presented in Fig. 6e and f, respectively. The 3D view of octants is shown in Fig. 7a, and vertical distributions of the octant probabilities are presented in Fig. 7b. Power spectral density functions for all three components of velocities are depicted in Fig. 8b.
P-SAT framework is also able to compute the mean velocities (U uf ) of unsteady flows from the instantaneous velocities of the unsteady flows ( U uf ) by using the Fourier-component method described in the theoretical development section. Figure 8a shows an example of the time series of instantaneous velocities U uf (t) and the mean velocity component U uf (t) . The P-SAT framework allows the user to select any random number of Fourier components. It can be observed from Fig. 8a  www.nature.com/scientificreports/ Limitations and precautions to be taken while using the P-SAT module. Following points needs to be taken care by a user executing the P-SAT framework.
1. The P-SAT framework requires a strict file naming convention for the input .dat input files. It is preferred that the input files should not have spaces or any special charter except "_". 2. Ensure all source files are strictly comma separated files. That implies that all source files have values separated by ', ' . 3. The input_files.txt and input_files_corresponding_depths.txt are correctly mapped (see Fig. 4). 4. The libraries mentioned in Sect. 4.6 must be installed before running P-SAT framework. 5. Ensure that no file/filename is changed while the P-SAT framework is executing. It may result in an erroneous output.  www.nature.com/scientificreports/ that provides for the computation of unsteady flows with a user chosen values of k (number of components). All this took a lot of time, and there were always room for errors. We wanted to provide the community with a free and open source module that can perform all the tasks hassle free. The P-SAT framework is completely open source, which allow the developers and scientific community to extend it to meet their needs and purposes.

Conclusion and future work
We provide the end user with an open source P-SAT framework that can enable the user to filter the raw velocity time series data obtained from the Nortek Vectrino+ ADV and compute various turbulent parameters. We believe that P-SAT framework is a first of its kind framework that can completely automate the process of parsing, filtering, computation of various turbulent statistics, spectra computation for steady flows. For unsteady flows the P-SAT framework obtains mean velocities from instantaneous velocities by using Fourier-component based averaging method. P-SAT framework also saves all the processed data files so that the end user can use it for future purposes without actually needing to run the P-SAT framework again on the same set of files.
The authors have put the source code on GitHub and would welcome suggestion and improvements for the same. We believe that the P-SAT framework will help the scientific community working with environmental hydraulics by providing a means to easily filter and compute the parameters. As the core code is developed in Python, it runs smoothly on Windows, Linux and MAC based operating system.
Right now the P-SAT framework works on a command line based environment, in the future we would like to build a GUI module for the same. The GUI module will allow for richer user experience, however, for this paper, we restrict it to command line environment only.       www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.