Overview
FasTensor, formerly known as ArrayUDF, is a generic parallel programming model for big data analyses with any user-defined functions (UDF). These functions may express data analysis operations from traditional database (DB) systems to advanced machine learning pipelines. FasTensor exploits the structural-locality in the multidimensional arrays to automate file operations, data partitioning, communication, parallel execution, and common data management operations.
FasTensor has the same idea as the MapReduce and Apache Spark to reduce the programming efforts. But, FasTensor is orders of magnitude faster than them because it is directly defined and executed on the multidimensional array, as shown below. Even comparing with highly optimized code in TensorFlow, FasTensor can achieve up to 13X performance speedup in executing expensive steps in CNN. FasTensor can scale up to over 10,000 CPU cores on supercomputers efficiently.
Installation Guide
FasTensor can be installed easily on platforms from single a laptop to a high performance computing system. Below is an example to install FasTensor on a Linux/Mac system from the scratch. Also, we include steps to install two major dependent software, MPICH and HDF5.############################################################# # Install MPICH at MPICH_ROOT/build. # # Please download it from https://www.mpich.org/downloads/ # # This example uses mpich-3.3.2.tar.gz # ############################################################# $ tar zxvf mpich-3.3.2.tar.gz $ export MPICH_ROOT=$PWD/build $ ./configure --prefix=$MPICH_ROOT $ make install ############################################################## # Install HDF5 at HDF5_ROOT/build # # Please get the source code from # # https://www.hdfgroup.org/downloads/hdf5/source-code/ # # The examples uses hdf5-1.12.0.tar.gz # ############################################################## $ tar zxvf hdf5-1.12.0.tar.gz $ cd hdf5-1.12.0 $ export HDF5_ROOT=$PWD/build $ ./configure --enable-parallel --prefix=$HDF5_ROOT CC=$MPICH_ROOT/bin/mpicc $ make install ###################################### # Install FasTensor at FT_ROOT/build # ###################################### $ git clone https://bitbucket.org/berkeleylab/fastensor.git $ cd fastensor $ export FT_ROOT=$PWD/build/ $ ./autogen.sh $ ./configure --prefix=$FT_ROOT --with-hdf5=$HDF5_ROOT CXX=$MPICH_ROOT/bin/mpicxx $ make install
Quick Start Example
#include "ft.h" using namespace std; using namespace FT; inline Stencil<float> udf_ma(const Stencil<float> &iStencil) { Stencil<float> oStencil; oStencil = (iStencil(0, -1) + iStencil(0, 0) + iStencil(0, 1)) / 3.0; return oStencil; } int main(int argc, char *argv[]) { FT_Init(argc, argv); std::vector<int> chunk_size = {4, 4}; std::vector<int> overlap_size = {0, 1}; Array<float> A("EP_HDF5:tutorial.h5:/dat", chunk_size, overlap_size); Array<float> B("EP_HDF5:tutorial_ma_new.h5:/dat"); A.Transform(udf_ma, B); FT_Finalize(); return 0; }
Publications
- Bin Dong, Kesheng Wu, Suren Byna, Houjun Tang, "SLOPE: Structural Locality-Aware Programming Model for Composing Array Data Analysis", International Conference on High Performance Computing, January 1, 2019, 61--80, PDF
- Bin Dong, Kesheng Wu, Surendra Byna, Jialin Liu, Weijie Zhao, Florin Rusu, "ArrayUDF: User-defined scientific data analysis on arrays", Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, January 1, 2017, 53--64, PDF
- Xin Xing, Bin Dong, Jonathan Ajo-Franklin, Kesheng Wu, "Automated Parallel Data Processing Engine with Application to Large-Scale Feature Extraction", 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC), January 1, 2018, 37--46, PDF
- Bin Dong, Patrick Kilian, Xiaocan Li, Fan Guo, Suren Byna, Kesheng Wu, "Terabyte-scale Particle Data Analysis: An ArrayUDF Case Study", Proceedings of the 31st International Conference on Scientific and Statistical Database Management, January 1, 2019, 202--205, PDF
- Bin Dong, Veronica Rodriguez Tribaldos, Xin Xing, Suren Byna, Jonathan Ajo-Franklin, Kesheng Wu, "DASSA: Parallel DAS Data Storage and Analysis for Subsurface Event Detection", 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), July 14, 2020, 254--263, PDF
Recent News
The book User-Defined Tensor Data Analysis is online, you can access it here or
here
A book with over 100 pages to introduce the FasTensor will come soon. Stay tuned !
Verónica presented the "Distributed Acoustic Sensing (DAS) at the Plot to Basin Scale: Connecting
Near-Surface Sensing and Seismology with a Common Observational Tool" at AGU 2020. The FasTensor
supports the data analysis stack in processing large scale DAS data.
ArrayUDF Points Astronomers to Colliding Neutron Stars : “By allowing users to focus on the logic of their applications instead of cumbersome data management tasks, we hope to significantly improve scientific productivity,” On August 17, 2017 the LIGO collaboration detected gravitational waves—literal ripples in the fabric of space-time. Read more...
Contact US
Please use the mailing list ([email protected]) for general questions.If you have any specific questions, you can reach below person with their specific email address,
Bin Dong ([email protected]), Suren Byna([email protected]) and Kesheng Wu ([email protected]).