Taiwan Computational Quantum Matter Software Foundry

In this integrated project, a team of experts in computational quantum many-body systems will work together to develop new algorithms and methods for quantum many-body systems optimized for different high-performance computing models and platforms, such as GPU clusters. The codes developed in this project will be used to perform numerical simulations to study the physics of strongly correlated models, by which one can explore the physics of exotic quantum phases and quantum entanglement in many-body systems.

World-class libraries will be built and open to the public to increase the use of the software and the international visibility of our team. Furthermore, with the massive computation power of GPUs, the developed packages and tool are expected to produce breakthroughs in the fields of quantum mechanics, quantum information and quantum computation. Future international collaborations could also be built based on the constructed a platform.


Development of Quantitative System for Risk Analysis in Finance

This project has two research units including "Volatility Information Platform" and "Nvidia-NTHU Joint Lab on Computational Finance." These two standing-along units are responsible for conducting academic research, developing new technologies, engaging potential industry partners through mutual activities between industry and academy, and forming a cooperative alliance in order to solve some real problems from the industry.

"Volatility Information Platform" is equipped with an interactive interface and an essence of cloud computing. Currently it has four alliance members, which are database companies such as Infotimes and Cmoney, and trading companies such as Yunta Futures and UliAssets. "Nvidia-NTHU Joint Lab on Computational Finance" has three foreign alliance members, which include a hardware company Nvidia (USA), a software company NAG (UK), and a financial service and consulting company BRODA (UK). Two former companies are internationally well-known.

Despite that the business natures of these alliance members are different, their well diversification can just increase the successful possibility of our cooperation. It is our deep desire to generate a highly robust opportunity for mutual growth between academy and industry from fundamental discussion with each business identity to sparking new business ideas for crossed industries via our initiation.


Solving Large-scale Numerical Problems on GPU

This is a joint project to study how GPUs can be used to accelerate computations in large-scale linear systems, optimization problems, and medical imaging. Several research results have been obtained, as shown in the publication list of 2012 achievement summary. First, new CPU/GPU hybrid multifrontal direct sparse solvers for unsymmetric and symmetric positive definite linear systems were proposed, and the performance models for computation and communication were built and analyzed. Second, a novel variable block size auto-tuning scheme on hybrid CPU-GPU system to improve the computational efficiency of MAGMA QR factorization was proposed. Third, the acceleration of particle swarm optimization (PSO) for solving box-constrained optimization problems by the parallelization on GPU was developed to find high dimensional Latin hypercube designs on regular domains and uniform designs on irregular domains. Last, the computations of iterative medical image reconstructions are accelerated by GPU to facilitate the patient dose reduction. GPUs are used to yield an efficient reconstruction process for a recently developed compact and high-sensitivity positron emission tomography system that consisted of two large-area panel detector heads.

Based on those results, the project will continue to investigate more new high performance numerical algorithms on GPU. The specific directions include, but not limited to, mixed precision methods, multiple GPU optimization, fast numerical algorithms, randomized algorithms, and communication avoidance methods.


Virtualization Techniques for ARM Processors

Virtualization technology is widely used in the area of cloud computing. However, nowadays virtualization technology is commonly used in X86-based platforms. For the embedded systems, virtualization and its applications is still in the infancy. Gartner report that virtualization over smartphone or other mobilized devices has a bright future, and by 2014, about half of the smartphone will come with virtualization support.

This project has two goals. First is to establish a "Virtualization Core Technology" research center in NTHU, which aims at connecting the related people and technology as well as training more expects in this field. The second goal is to develop ARM server virtualization technology. In the previous project, we have developed a KVM-based hypervisor running on ARM v6 and v7 architecture (called ARMvisor). ARMVsior is a pure software hypervisor, without considering the hardware support of virtualization. However, the new ARM Cortex A15 processor has hardware virtualization support. Therefore, in this project we will improve the performance of ARMvisor via developing the core virtualization technology in CPU, Memory, and I/O as well as fault-tolerance mechanism of virtual machine.


A Mixed OpenMP/MPI Programming Framework for Hybrid CPU/GPU Cluster Computing

This project proposes a compound OpenMP/MPI programming framework called OMPICUDA for hybrid CPU/GPU computing architecture, with which users can develop applications on hybrid CPU/GPU clusters by using OpenMP and/or MPI, and select different resource configurations, such as CPU-only, GPU-only, or Hybrid CPU/GPU, in different parallel regions of a program according to the property of parallel regions. The proposed framework can reduce the programming complexity of hybrid CPU/GPU computing architecture since users have no need to learn the GPU programming, and can provide a chance for users to further optimize the execution performance of their programs by proper resource selection. On the other hand, OMPICUDA supports resource sharing, load balance and resource reallocation. In other words, it can redirect the CUDA functions issued from different user applications to the same remote GPU for execution, and can automatically reallocate new CPUs or GPUs for user programs according to the availability degrees of resources by cooperating with the resource manager in a cluster. Moreover, it offers a set of runtime functions for users to achieve load balance in a hybrid CPU/GPU cluster.


Accelerating Pattern Matching Using a Novel Parallel Algorithm on GPUs

In this research, we propose a Parallel Failureless Aho-Corasick (PFAC) algorithm on NVIDIA GPUs. To efficiently utilize the power of GPU acceleration, we introduce several throughput-oriented techniques including reducing global memory transactions, reducing the latency of transition table lookup, eliminating output table accesses, avoiding bank-conflict of shared memory, coalesced writing to the global memory, and enhancing communication between CPU and GPU. The performance and memory usage are measured using attack patterns from Snort V2.8 and input streams from DEFCON. The experimental results show that PFAC achieves significant speedup over the AC algorithm and equivalent implementations on multicore CPUs. For matching Snort patterns over the DEFCON packet of size 256 MB, the PFAC running on NIVIDIA GTX580 GPU achieves up to 143.16 Gbps throughput, 14.74 times faster than the Aho-Corasick algorithm performed on Intel CoreTM i7-950 with OpenMP library and more than four times faster than state-of-the-art GPU approaches. The library of the PFAC algorithm is accessible in Google Code.


In bioinformatics field, we have designed several CUDA programs for famous bioinformatics tools and algorithms. These CUDA programs included GPU-REMuSiC, CUDA-FRESCO, CUDA-ClustalW, Parallel Shellsort, CUDA-UPGMA, and CUDA-SWF.  These programs can be used to solve fundamental bio-problems, such as multiple sequence alignment, pattern search, next-generation sequencing, and classifications.  We are now designing a workbench (Bioinformatics Workbench in cuda, B&Wc) with a user-friendly interface for biologists to use our CUDA programs. This workbench will be released in the near future. We will plan to design a series tools in CUDA for computer-aided drug design.

Music Information Retrieval

Query by singing/humming (QBSH) is an intuitive method for music retrieval, where users are able to retrieve intended songs by singing or humming a portion of them. However, the computational load gets higher and higher as the size of database grows. Thus we have started to migrate our web-deployed QBSH system to GPU. In our initial implementation, n blocks are used if we have n songs in the database. The computational load for a query is separated equally to the threads in a certain block. We performed some experiments and analysis to find the most suitable number of threads in a block. With the best result, a speedup factor of 66 was achieved (see ref.[P11]), which is an important milestone for our web-deployed QBSH service.

Computer Graphics

CUDA was used to build a realtime ray-traer in our researches. It was used to generate the shadow rays, reflection rays, refraction rays, and as well as the texture processing. The peak performance reaches 34M rays per second with NVIDIA 8800GTS implemented with CUDA 1.1. We found that the memory access will dramatically affect the performance. In order to reduce the memory usage, we implemented a ray stack to efficiently reduce the maximum memory usage. We also compared our work with NVIDIA OptiX 2.0. In complex scene, such as Spanza, our CUDA ray tracer is faster.  CUDA is also used in our ongoing projects. It is used to accelerate the fluid simulation and some global illumination process.

HPC and GPU Performance Optimization

Besides adapting the various applications on GPU, developing new technologies to optimize the performance of CUDA on GPU is also one of research focus.  In 2010, Lung Sheng Chien, the former graduated student in our department, had proposed the fastest hand-tune SGEMM code on GT200, which achieves over 70% of peak performance.   We also had proposed two new techniques, data compression and data streaming, to reduce the communication cost on GPU via reducing the data size and overlapping the communication and computation.  With those two techniques, the problem of rectangle intersection query can receive over 30 times speedup. Recently, we had investigated the effect of synchronization on GPU, and discovered the artificial barrier synchronization can speed up the performance of some applications.