Scientific and large scale computing day

Scientific and large scale computing day, May 13th 2015

Organizers: Eric Darve and Laura Grigori

Meeting room 242, CITRIS-SDH Building, UC Berkeley
Organized in the context of BIS Workshop.
To help with the organization, please register here.

This day will gather Inria and Bay Area researchers working on different aspects of data science and high performance computing. Speakers will present novel results covering aspects from numerical and randomized linear algebra, parallel and heterogeneous computing, big data analysis, and uncertainty quantification.

Program

Fast solvers and randomized linear algebra
9 AM - 12 PM, with cofee break: 10:30 AM - 11 AM

Ming Gu, UC Berkeley, abstract
Spectrum-revealing Matrix Factorizations

M. Mahoney, UC Berkeley
Randomized Numerical Linear Algebra, abstract

Eric Darve, Stanford
O(N) linear solvers for general H^2 matrices, abstract

Jack Poulson, Stanford
Revisiting distributed sparse-direct solvers for Interior Point Methods and generalized Least Squares problems, abstract

Laura Grigori, Inria
Communication avoiding algorithms for computing low rank approximations, abstract

Lunch break : 12 PM - 1:30 PM

Accessing and computing with big data
1:30 PM - 3 PM

Shivaram Venkataraman, UC Berkeley
High performance linear algebra on Spark, abstract

Tristan Allard, Inria
PINED-RQ: A Differentially Private Index on Encrypted Databases for supporting Range Queries, abstract
and Maximilien Servajean, Inria
Increasing Coverage in Distributed Search and Recommendation with Profile Diversity, abstract

Lavanya Ramakrishnan, LBNL
Tigres: User-level Workflow Library for DALHIS, abstract

Coffee break: 3 PM - 3:30 PM

Uncertainty quantification 3:30 PM - 5 PM

Pietro Congedo, Inria
Some recent studies on uncertainty quantification and robust optimization in the AQUARIUS Team, abstract

Gianluca Geraci, Stanford University
UQ in particle-laden turbulent flows - Stanford PSAAP II project, abstract

Akshay Mittal, Stanford University/Inria
UQ Algorithms for complex multiphysics applications, abstract

Abstracts

Fast solvers and randomized linear algebra
9 AM - 12 PM, with cofee break: 10:30 AM - 11 AM

Ming Gu, UC Berkeley
Spectrum-revealing Matrix Factorizations

Abstract: In this talk, we discuss our current work on spectrum-revealing LU, QR and Cholesky factorizations. We show that with the right column and row permutations LU, QR and Cholesky factorizations are as reliable as SVD in low-rank matrix approximations. Additionally, we discuss our current work to design and implement fast matrix approximation algorithms based on LU, QR and Cholesky factorizations.

M. Mahoney, UC Berkeley
Randomized Numerical Linear Algebra

Randomization in recent years has proven to be a powerful computational resource for the development of improved matrix algorithms for both scientific computing and large-scale data analysis applications. Successes include the best algorithm in worst-case asymptotic theory for the over constrained least squares regression problem, high-quality numerical implementations based on those ideas that have been shown to beat highly-optimized numerical libraries, and implementations of least-squares, low-rank matrix approximation, and other related problems to low, medium, and high precision on up to a terabyte of data. Here, we will provide an overview of the basic ideas in the area and how these ideas are used to yield such successes.

Eric Darve, Stanford
O(N) linear solvers for general H^2 matrices

We will present recent developments resulting from the Stanford-INRIA Bordeaux collaboration on fast linear algebra. Working with Drs. Pierre Ramet, Luc Giraud and Jean Roman, we are working on the development of an O(N) sparse solver built on top of Pastix. Pastix is a sparse direct solver optimized for multicore and heterogeneous clusters. Sparse direct solvers are currently limited by their memory requirements and computational cost. They are competitive for small matrices but are often less efficient than iterative methods for large matrices. We are currently accelerating the dense algebra components of Pastix using hierarchical matrices algebra. This will reduce the total cost from O(N^2) to O(N^(4/3)). In the long run, this will be further reduced to O(N). Such improvements will make these solvers competitive against algebraic multigrid methods and other efficient iterative approaches.

Jack Poulson, Stanford
Revisiting distributed sparse-direct solvers for Interior Point Methods and generalized Least Squares problems

Despite the current surge in academic and commercial interest in first-order methods for optimization, research on sparse-direct techniques for distributed second-order methods and (generalized) least squares solvers has received comparatively little attention. This talk will discuss the high-level details associated with the efficient, scalable solution of sparse, indefinite KKT systems, such as the choice of regularization and how to refine the inexact solution of the regularized system towards the true solution of the original system. An example of a small, sparse matrix with a condition number of less than 15 which breaks many state-of-the-art codes will be presented and an argument will be made for preconditioning a Flexible GMRES(k) solver for the original quasi-semidefinite KKT system with the (quad-precision) iteratively-refined solution of the a-priori regularized system's Cholesky-like sparse-direct factorization. Furthermore, the recent open-source implementation of this approach within Elemental will be discussed, as well as its immediate extension towards Equality-constrained Least Squares, General Linear Models, and general (nonsymmetric) sparse-direct solvers. Plans for future algorithmic and architectural generalizations will also be discussed.

Laura Grigori, Inria
Communication avoiding algorithms for computing low rank approximations

In this talk we discuss novel algorithms for computing low rank approximations of sparse matrices. We present two different algorithms based on rank revealing QR and rank revealing LU factorizations. Both algorithms are based on tournament pivoting, which allows to minimize communication in rank revealing factorizations based on column pivoting. Preliminary experiments on several difficult problems show that both algorithms are effective in approximating the singular values of the input matrix, and thus can be used to either compute a low rank approximation or reveal the rank of a matrix. We also compare the fill-in of a rank revealing QR factorization with respect to a rank revealing LU factorization. This research is part of the COALA associated team between the Alpines group at Inria and J. Demmel's group at UC Berkeley. Given that there is an exponentially increasing gap between the time required to compute floating point operations and the time required to move data between different levels of memory hierarchy or between different processors, the overall goal of this team is to design novel numerical linear algebra algorithms that reduce communication, or even minimize it whenever possible.

Lunch break : 12 PM - 1:30 PM

Accessing and computing with big data
1:30 PM - 3 PM

Shivaram Venkataraman, UC Berkeley
High performance linear algebra on Spark

With the adoption of cloud computing and growth in amount of data being processed, fault tolerant distributed computation engines like Spark are used to implement many large machine learning algorithms. In this talk, we discuss how large such algorithms can be expressed in terms of distributed linear algebra operations and describe some of the challenges in implementing communication avoiding QR algorithms (TSQR and CAQR) on a data-flow system like Spark. We also present results from profiling our implementation on Amazon EC2 and identify some of the strengths and weakness of using Spark for high performance linear algebra.

Tristan Allard, Inria

PINED-RQ: A Differentially Private Index on Encrypted Databases for supporting Range Queries
Despite the benefits of Database-as-a-Service cloud services (DBaaS), legitimate privacy concerns continue hindering their adoption when data is personal and sensitive. To tackle this problem we propose PINED-RQ, a privacy-preserving auxiliary data structure allowing the DBaaS provider to perform range queries efficiently over an encrypted database. PINED-RQ provides sound privacy guarantees thanks to the joint use of differential privacy and of a semantically-secure encryption scheme. Its careful design together with dedicated query evaluation strategies enable accurate answers to range queries. PINED-RQ is an ongoing work of the BigDatanet associate team between the Inria Zenith team and the UCSB DSL team.

and Maximilien Servajean, Inria
Increasing Coverage in Distributed Search and Recommendation with Profile Diversity

In the context of Web 2.0, the users become massive producers of diverse data that can be stored in a large variety of systems. The fact that the users’ data spaces are distributed in many different systems makes data sharing difficult. In this context of large scale distribution of users and data, a general solution to data sharing is offered by distributed search and recommendation. In particular, gossip-based approaches provide scalability, dynamicity, autonomy and decentralized control. Generally, in gossip-based search and recommendation, each user constructs a cluster of “relevant” users that will be employed in the processing of queries. However, considering only relevance introduces a significant amount of redundancy among users. As a result, when a query is submitted, as the user profiles in each user’s cluster are quite similar, the probability of retrieving the same set of relevant items increases, and recall results are limited. In this paper, we propose a gossip-based search and recommendation approach using diversity-based clustering scores, and we present the corresponding new gossip-based clustering algorithms. We validate our proposal with an experimental evaluation using four datasets based on MovieLens-small, MovieLens, LastFM and Delicious. Compared with state of the art solutions, we show that taking into account diversity-based clustering score enables to obtain major gain in terms of recall while reducing the number of users involved during query processing.

Lavanya Ramakrishnan, LBNL
Tigres: User-level Workflow Library for DALHIS

Scientific facilities are increasingly generating large data sets. Most of these communities face a number of challenges in data management, workflow management and data transfer while managing the workflow. Next-generation scientific productivity relies on user-friendly tools and efficient and effective execution of these data workflows. The goal of the DALHIS associated team is to create a user-friendly, scalable, energy-efficient and fault tolerant software ecosystem to facilitate seamless data analysis across desktops, HPC and cloud environments. In this talk, I will talk about user-centered design processes that are necessary for addressing the needs of next-generation data-intensive workflows. I will discuss Tigres, a user-level workflow library that addresses the challenge of enabling collaborative analysis through a new concept of reusable "templates". Templates enable scientists to easily compose, run, and manage collaborative computational tasks. I will will also outline results to date on the integration of Tigres with High Order Chemical Language (HOCL) for DALHIS.

Coffee break: 3 PM - 3:30 PM

Uncertainty quantification 3:30 PM - 5 PM

Pietro Congedo, Inria
Some recent studies on uncertainty quantification and robust optimization in the AQUARIUS Team

In this talk, we will provide an overview about the most recent studies conducted in the Aquarius team. First, an alternative formulation for performing design under uncertainty will be introduced. The second part of the talk will be devoted to a more applicative study concerning the investigation about uncertain metastable conditions in cavitating flows.

Gianluca Geraci, Stanford University
UQ in particle-laden turbulent flows - Stanford PSAAP II project

In the context of the PSAAP II project, we are interested in developing tools and strategies to perform predictive simulations of particle-laden turbulence in a radiation environment. One of the foundations of predictive science is represented by the Uncertainty Quantification activities which allow to characterize first, and then quantify errors and uncertainties in the simulations. Therefore, it is possible to determine the degree of confidence which can be placed in the numerical results. In this talk an overview of the current UQ activities is presented. In particular, two main areas are emphasized. The first activity is devoted to improve non-intrusive algorithms for the forward propagation of uncertainty in presence of high-dimensional parameter space exploiting the so-called Multi-level Monte Carlo approach. Preliminary results on a model problem are discussed to enlighten the potential advantages of this approach. Our second focus is the characterization and propagation of the polydisperse particle size distribution though a Direct Numerical Simulation flow solver. One of the relevant problem addressed is the treatment of the tail of the particles size distribution which is subject to a lack of information due to experimental limitations

Akshay Mittal, Stanford University/Inria
UQ Algorithms for complex multiphysics applications

Multiphysics systems governed by coupled partial dierential equations (PDEs) are naturally suited for modular (partitioned) numerical solution strategies. Although widely used in tackling deterministic models, several challenges arise in extending the benets of modularization to uncertainty propagation. Monte Carlo (MC) based sampling methods ignore the potentially exploitable structures within the multiphysics model, and are generally unreliable because the cost of each PDE solve is signicantly high. On the other hand, spectral methods, for instance, generalized polynomial chaos (gPC) based methods, would succumb to the curse of dimensionality, if implemented in their standard (traditional) form, as the coupled nature of the model dictates that each module should handle the combined parameter space for uncertainty propagation. Here we present a practical module-based framework and ecient spectral methods for uncertainty propagation in multiphysics systems with uncertain parameters. Our proposed framework facilitates complete module-based modeling independence, wherein each module only handles its local uncertain parameters, employing the best suited method. Moreover, the proposed reduced non-intrusive (NISP) and reduced intrusive spectral projection (ISP) methods mitigate the curse of dimensionality by constructing reduced dimensional (and order) approximations of the data communicated between modules and iterations. We demonstrate implementations of our proposed methods on several benchmark test problems, and illustrate their superior performance over standard monolithic methods.

> Top of the page

INRIA Paris - Rocquencourt
Laboratoire J. L. Lions
Universite Pierre et Marie Curie