Overview

FasTensor, formerly known as ArrayUDF, is a generic parallel programming model for big data analyses with any user-defined functions (UDF). These functions may express data analysis operations from traditional database (DB) systems to advanced machine learning pipelines. FasTensor exploits the structural-locality in the multidimensional arrays to automate file operations, data partitioning, communication, parallel execution, and common data management operations.

FasTensor has the same idea as the MapReduce and Apache Spark to reduce the programming efforts. But, FasTensor is orders of magnitude faster than them because it is directly defined and executed on the multidimensional array, as shown below. Even comparing with highly optimized code in TensorFlow, FasTensor can achieve up to 13X performance speedup in executing expensive steps in CNN. FasTensor can scale up to over 10,000 CPU cores on supercomputers efficiently.

FT vs Spark FT vs TF

Installation Guide

FasTensor can be installed easily on platforms from single a laptop to a high performance computing system. Below is an example to install FasTensor on a Linux/Mac system from the scratch. Also, we include steps to install two major dependent software, MPICH and HDF5.

#############################################################
# Install MPICH at MPICH_ROOT/build.                        #
# Please download it from https://www.mpich.org/downloads/  #
# This example uses mpich-3.3.2.tar.gz                      #                     
############################################################# 
$ tar zxvf mpich-3.3.2.tar.gz
$ export MPICH_ROOT=$PWD/build
$ ./configure --prefix=$MPICH_ROOT
$ make install
##############################################################
# Install HDF5 at HDF5_ROOT/build                            #
# Please get the source code from                            #
#   https://www.hdfgroup.org/downloads/hdf5/source-code/     #
# The examples uses hdf5-1.12.0.tar.gz                       #
##############################################################
$ tar zxvf hdf5-1.12.0.tar.gz
$ cd hdf5-1.12.0
$ export HDF5_ROOT=$PWD/build
$ ./configure --enable-parallel --prefix=$HDF5_ROOT CC=$MPICH_ROOT/bin/mpicc 
$ make install
######################################
# Install FasTensor at FT_ROOT/build #
######################################
$ git clone https://bitbucket.org/berkeleylab/fastensor.git
$ cd fastensor
$ export FT_ROOT=$PWD/build/
$ ./autogen.sh
$ ./configure --prefix=$FT_ROOT --with-hdf5=$HDF5_ROOT CXX=$MPICH_ROOT/bin/mpicxx
$ make install

Quick Start Example

Below is an example to use FasTensor to perform a three-point moving average operation on a 2D array stored in a HDF5 file. The three-point moving average operation is expressed as a user- defined function udf_ma. FasTensor uses the C++ main function to set up both input and output, to transform the data from the input to the output based on the expression in udf_ma. Note that the example code can run either sequentially on a single CPU or parallel on multiple CPUs across multiple computing nodes without any modification.
#include "ft.h"

using namespace std;
using namespace FT;
					
inline Stencil<float> udf_ma(const Stencil<float> &iStencil)
{
	Stencil<float> oStencil;
	oStencil = (iStencil(0, -1) + iStencil(0, 0) + iStencil(0, 1)) / 3.0;
	return oStencil;
}
					
int main(int argc, char *argv[])
{
	FT_Init(argc, argv);
	std::vector<int> chunk_size = {4, 4};
	std::vector<int> overlap_size = {0, 1};
	Array<float> A("EP_HDF5:tutorial.h5:/dat", chunk_size, overlap_size);
	Array<float> B("EP_HDF5:tutorial_ma_new.h5:/dat");
	A.Transform(udf_ma, B);
	FT_Finalize();
	return 0;
}

Publications

Recent News

January 1, 2022

The book User-Defined Tensor Data Analysis is online, you can access it here or here
Sample logotype

May 28, 2020
Author: Bin Dong, Kesheng Wu, Suren Byna

A book with over 100 pages to introduce the FasTensor will come soon. Stay tuned !
Sample logotype

15 December 2020
Author: Verónica Rodríguez Tribaldos, Avinash Nayak, Nathaniel J Lindsey, Feng Cheng, Benxin Chi, Bin Dong, Kesheng Wu, Inder Monga

Verónica presented the "Distributed Acoustic Sensing (DAS) at the Plot to Basin Scale: Connecting Near-Surface Sensing and Seismology with a Common Observational Tool" at AGU 2020. The FasTensor supports the data analysis stack in processing large scale DAS data.
Sample logotype


JANUARY 2, 2018
Tags: ArrayUDF, LIGO, Data analysis
Author: Linda Vu

ArrayUDF Points Astronomers to Colliding Neutron Stars : “By allowing users to focus on the logic of their applications instead of cumbersome data management tasks, we hope to significantly improve scientific productivity,” On August 17, 2017 the LIGO collaboration detected gravitational waves—literal ripples in the fabric of space-time. Read more...

Contact US

Please use the mailing list ([email protected]) for general questions.
If you have any specific questions, you can reach below person with their specific email address,
Bin Dong ([email protected]), Suren Byna([email protected]) and Kesheng Wu ([email protected]).