

# FPGA ACCELERATED HPC AND DATA ANALYTICS

Mike Strickland, Data Center Solution Architect Intel Programmable Solutions Group

December 2018

# **NOTICES AND DISCLAIMERS**

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration.

No product or component can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit <u>http://www.intel.com/benchmarks</u>.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="http://www.intel.com/benchmarks">http://www.intel.com/benchmarks</a>.

Intel<sup>®</sup> Advanced Vector Extensions (Intel<sup>®</sup> AVX)\* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel<sup>®</sup> Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at <a href="http://www.intel.com/go/turbo">http://www.intel.com/go/turbo</a>.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessors dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

\*Other names and brands may be claimed as property of others. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

© 2018 Intel Corporation.







Intro: Scale Out and Scale Up with FPGA

- Programmer Tool Chain
  - Financial Library API Example
- Libraries & Turnkey Solutions



## MICROSOFT EXA-OP WITH FPGAS (IGNITE SEPT 2016)



| 0     | translatordemox-decarureeeb                     | and a second second                                       |      |                       |                                                                                                | □ ☆ = 1                                                                                                                            | Z |
|-------|-------------------------------------------------|-----------------------------------------------------------|------|-----------------------|------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|---|
| Which |                                                 | Microsoft Translator V1.00.23400.11102                    |      | Wikiped<br>Publisher: | Wikipedia (English version)<br>Publisher: Wikimedia Foundation A free online encyclopedia that |                                                                                                                                    |   |
|       | Data Source: Wikipedia<br>Translate to: Spanish |                                                           | •    | Articles<br>Words:    | >5.2 million<br>-3.1 Billion                                                                   | A free online encyclopedia that anyone<br>can edit, and the largest and most<br>popular general reference work on the<br>Internet. |   |
|       | Processor Type<br>Azure FPGA Server -           | - SV4-D5-1U                                               | •    |                       | Model:<br>Peak Power/Unit:                                                                     | ×<br>10 CPU cores + 4 FPGAs<br>Stratix V DS-accelerator<br>240 Watts                                                               |   |
|       | Compute Capacity                                | 10T                                                       | 1007 | 119                   |                                                                                                | IDOP TË                                                                                                                            |   |
|       | Compute Cap                                     | Compute Capacity:<br>Estimated Time:<br>Pages Per Second: |      | 1 Exa-op              |                                                                                                | 1,000,000 Tera-ops                                                                                                                 |   |
|       | Estimated Tin                                   |                                                           |      |                       | 0.098 seconds                                                                                  |                                                                                                                                    |   |
|       | Pages Per Second                                |                                                           |      |                       |                                                                                                | 78,120,000                                                                                                                         |   |
|       | Pages Translated:                               |                                                           |      |                       | 0                                                                                              |                                                                                                                                    |   |

## Translate every Wikipedia English page to another language in the blink of an eye

Link to video (~ minute 55)



## **MSFT SINGLE FPGA ALGORITHM, NETWORKING, & DATA ACCESS ACCELERATION**





Five day query throughput and latency of ranking service queries running in production, with and without FPGAs enabled.

#### Microsoft Scale Out FPGA Multi-Function Accelerator

- "Diversity of cloud workloads and ... rapid ... change" (weekly or monthly)
  - Search, SmartNIC, Machine Learning, Encrypt, Compress, Big Data Analytics,...
- Bing Search: 2X server level perf, 29% latency reduction, 10% increase in power<sup>(1)</sup>
- Networking Virtualization: 10X latency improvement, 2X perf many db and OLTP workloads <sup>(2)</sup>
- Machine Learning: Stratix 10 capable of 90 TFLOPs 8 bit floating point <sup>(3)</sup>



# **WHAT FPGA DOES WELL**

- Custom Processing Pipelines
- Variable Precision Arithmetic
- Heterogenous Dataflows
- Diverse Memory Hierarchy

- Multiple Workloads
- High-bandwidth data caching
- Look-aside AND In-line Acceleration
- Parallel processing

| LOW LATENCY        | INHERENTLY PARALLEL | HIGH PERFORMANCE |  |  |
|--------------------|---------------------|------------------|--|--|
| VARIABLE PRECISION | REPROGRAMMABLE      | ENERGY EFFICIENT |  |  |

1



## Applications Acceleration Overview Framework or API's with OpenCL underneath option

### **INTEL<sup>®</sup> OPENVINO<sup>™</sup>**

With FPGA acceleration option

### DATA ANALYTICS

- Open Relational D/B
  - Data Warehouse, Real Time
- Cassandra NoSQL, ElasticSearch
- Hadoop/Spark

### **HPC: PROGRAMMER API**

- Genomics, Financial
- Government pattern matching
- Video transcode
- Emerging: oil & gas

#### Different Data Store Approaches







## Infrastructure Acceleration Overview

### **NETWORKING & DATA ACCESS**

- In-line advantage over look-aside
- Compression, Encryption, Dedupe
- Virtualization or complete network stack
- Torus & inline acceleration

### NVME OVER ROCE WITH ACCELERATORS

Attala cpu offload & inline acceleration





## SCALE OUT: NOVO-G 3D TORUS

### Developed and deployed at CHREC

- Most powerful reconfigurable computer in research community (2009-present)
  - 448 (soon 512) high-end Altera FPGAs with 3.5TB (soon 4.5TB) of FPGA-attached SDRAM
- 2012 Alexander Schwarzkopf Prize for Technology Innovation @ NSF

### App acceleration

 Computer vision, finance, bioinformatics, molecular dynamics, crypto, et al.

### Hardware emulation

 Behavioral emulation of future apps and systems, up to Exascale

### 2015 - 2016

- 64 Altera Stratix-V D8 FPGAs
  - On Gidel ProcE-V PCIe boards
  - Additional 64 in development
- 3D torus interconnect (4x4x4)
  - 6 links per Stratix-V (40 Gb/s per link)



- 2011: 96 Altera Stratix-IV E530 FPGAs, each with 8.50GB SDRAM, on Gidel ProcStar-IV cards
- 2012: 96 Altera Stratix-IV E530 FPGAs, each with 8.50GB SDRAM, on Gidel ProcStar-IV cards
- 2014: 32 Altera Stratix-V D8 FPGAs, each with 16GB SDRAM and 3D torus i/f, on Gidel ProcE-V card
- 2015: 32 Altera Stratix-V D8 FPGAs, each with 16GB SDRAM and 3D torus i/f, on Gidel ProcE-V card

3D torus interconnect



Made possible by support from UF, Altera, Gidel, NSF, and DOE





### 520N PCIe FPGA Board

#### Intel Stratix 10 FPGA

- GX2800 F1760 NF43
- -1 or -2 SerDes and FPGA speed grades

#### PCIe <sup>3</sup>/<sub>4</sub> Length, Dual Width

 Gen3 x16, standard height, <sup>3</sup>/<sub>4</sub> length

#### Four 100G QSFPs

 Range of line rates include 4x 40/100G or 16x 10/25G

#### 4x DDR4 Banks

2400MT/s, up to 32GB total

### **OpenCL HPC BSP**









SCALE YOUR INNOVATION







# INTEL<sup>®</sup> EMBEDDED MULTI-DIE INTERCONNECT BRIDGE (EMIB) TECHNOLOGY

Intel Stratix 10 Intel ANDMOTH REMORY DRAM EMIB MAN Intel HyperFlex FPGA Architecture MAN EMIB Intel HyperFlex Intel Androne Internet Intel Androne Internet

Package Lid Die **FPGA Die** 000 Package Substrate 0000000 Flip-Chip Pitch > 100µm Microbump pitch 55mm

A .....

**SCALE YOUR INNOVATION** 



## SCALE UP: STRATIX <sup>®</sup> 10 MX

**Development Kit Contents** 

- Development board
  - Stratix <sup>®</sup> 10 MX FPGA, 2.1M LE, 8GB HBM2
- 2x QSFP28 cages
- 2GB DDR4 onboard
- Hi-Lo and DIMM Connectors for DDR4
- PCle Gen3x16
  - Endpoint via edge connector
  - Rootport via slot

Intel.com page: (link)





## SCALE UP: FALCON MESA NEXT GENERATION 10 NM FPGAS

### **CONTINUING PRODUCT LEADERSHIP**

- Built on Intel Custom Foundry 10 nm platform
- 2<sup>nd</sup> Generation Intel<sup>®</sup> HyperFlex<sup>™</sup> Architecture
- 2<sup>nd</sup> Generation EMIB-based heterogenous SiP
- Next Generation HBM Support
- Up to 112 Gbps Transceiver Rates
- PCI-Express Gen4 x16 Support

## Falcon Mesa

10nm FPGAs Built on World's Most Advanced FinFET Process

## **Delivering Industry Leading Performance and Power**

# **PROGRAMMER TOOL CHAIN**



-

### **SCALE YOUR INNOVATION**



## **COMMON HETEROGENOUS PROGRAMMING ENVIRONMENT**

Common environment for heterogenous programming

- Plugin to Intel<sup>®</sup> System Studio and Intel Parallel Studio
- CPU, GPU, FPGA, ...

Easy path to FPGA

- Already familiar environment
- Intel<sup>®</sup> Developer Zone



A .....



## **OPENCL<sup>™</sup> "PROGRAMMER FRIENDLY" ACCELERATION**

### **Software Programmers**

Need Logic and Data Management

By writing lines of code

### **OpenCL<sup>™</sup> Compiler Benefits**

- Ease of use
- Scalable
- Heterogeneous
- Leverage existing libraries
- Vendor choice w/open standards
- Foundation for HLS & SPIR

### **Channels/Pipe Extension**

- Kernel → Kernel
- External IO  $\rightarrow$  Kernel
- Mix 'n Match HDL & Kernels



A sie





## **LEVERAGING SOFTWARE DEVELOPMENT ENVIRONMENT**





## **API CALL + OPENCL OPTION**

### FinLib Example

- High level C++ APIs
- OpenCL implementation
- "Quants" can use high level APIs
- Similar approach used for PairHMM



1

intel

# **FPGA IN COMPUTING MAINSTREAM**

## **INGREDIENTS NEEDED TO MAKE FPGA IN COMPUTING "MAINSTREAM"**



Ecosystem Partners & Integrators

**OEM** Qualification

**User Application** 

Data Framework (e.g. Apache Spark\*)

Scalable Functions (e.g. PDE Solver)

Library Primitives (Mathematics, Statistics)

Developer SDKs (e.g. OpenCL<sup>™</sup>)

Acceleration Stack, Drivers, BSPs & Interface IP

**Boards and Platforms** 

FPGA Silicon & FPGA Design Tools

**SCALE YOUR INNOVATION** 



## **ACCELERATION STACK INGREDIENTS: OVERVIEW**



## **OPEN PROGRAMMABLE ACCELERATION ENGINE (OPAE) TECHNOLOGY**

Simplified FPGA Programming Model for Application Developers

#### **Consistent API across product generations and platforms**

• Abstraction for hardware-specific FPGA resource details

#### Designed for minimal software overhead and latency

• Lightweight user-space library (libfpga)

SCALE YOUR INNOVATION

#### Open ecosystem for industry and developer community

License: FPGA API (BSD), FPGA driver (GPLv2)

#### FPGA driver being upstreamed into Linux\* kernel

Supports both virtual machines and bare metal platforms

Faster development and debugging of Accelerator Functions with the included AFU Simulation Environment (ASE)\*\*

Includes guides, command-line utilities and sample code



### Start developing for Intel<sup>®</sup> FPGAs with OPAE today: <u>http://github.com/OPAE</u>

\*\*ASE requires Acceleration Functions written in RTL and a properly installed RTL simulator: Synopsys VCS-MX, Mentor Graphics' ModelSim-SE\*/QuestaSim Some hames pending final approval and may change in the future. Synophys: Red Hat Enterprise Linux\* 7.3 w/ kernel 4.7 Intel \* Xeon\* Processors v4 or newer



## INTEL® PROGRAMMABLE ACCELERATION CARD WITH INTEL®ARRIA® 10 GX FPGA (AKA "RUSH CREEK")



#### **Board Management Controller**

Voltage, current, temperature monitoring Power sequencing and reset Platform Level Data Model (PLDM)

#### Power

70W TDP, 45W FPGA 650 LFM at Tla 55°C – Passively Cooled

Specifications preliminary and are subject to change

#### Arria<sup>®</sup> 10 GX FPGA [10AX115N2F40E2LG]

High-perf, multi-gigabit SerDes transceivers up to 15 Gbps 1150K logic elements available (-2L speed grade) 53 Mb of embedded memory

#### **On-board Memory**

8 Gbytes DDR4 Memory Banks with ECC (2 banks), 2400 Mbps 1Gb Mbit (128 MB) Flash

#### **Interfaces & Dimensions**

PCIe x8 Gen3 electrical, x16 mechanical \* USB 2.0 interface for debug and prog FPGA and Flash 1x QSFP with 4x 10GbE or 40GbE support ½ Length, ½ Height, 1RU

#### Software

Acceleration Stack for Intel<sup>®</sup> Xeon<sup>®</sup> CPU with FPGAs FPGA Interface Manager Installed



## INTEL® PROGRAMMABLE ACCELERATION CARD WITH INTEL® STRATIX® 10 FPGA GX (AKA "DARBY CREEK"<sup>2</sup>)

#### Versatile Workload Acceleration

• Customizable Hardware Architecture using Intel<sup>®</sup> Stratix<sup>®</sup> 10 FPGA GX

### High Performance with Intel® Stratix® 10 FPGA GX

- 2.8M logic elements available with 229Mb of embedded memory
- 32GB DDR4 Memory with ECC (4 banks), 2400 Mbps

### High Data Ingestion and Lower Latency

- PCIe\* x16 v3 with SRIOV support
- 2x QSFP with 100GbE support

### **PCIe\* Form Factor Compliant**

- Dual slot, three-fourths length, full height
- 225W TDP Passively Cooled
- Intel Max<sup>®</sup> 10 based Board Management Controller
  - Configuration, telemetry, and remote update

<sup>1</sup> Rush Creek: 1.1M LE A10, PCIe Gen 3 x8, 8G DDR3, 40GbE

<sup>2</sup> Darby Creek: 2.8M LE S10, PCIe Gen 3 x16, 32G DDR4, 100GbE

1.0



**SCALE YOUR INNOVATION** 

# FINANCIAL LIBRARY API EXAMPLE

## PDE SOLVER - DOUBLE NO-TOUCH (DNT) OPTION PRICING ENGINE

How much should the bank charge the Investor as **Option Price?** 





## **CREATING A PDE SOLVER IN FPGA**

FPGA used to provide a solver for a particularly computationally challenging workload

- Intent: Improve time to results
  - More results (Present Value of Options (PVs)) in a given amount of time or compute resource
- Starting point: C-model implementation of PDE Solver created (880 lines of C code)
- End point: Optimised OpenCL<sup>™</sup> implementation (920 lines of OpenCL)

| Task                                  | Dev. Time | Result      |
|---------------------------------------|-----------|-------------|
| Convert C model to OpenCL             | 2 weeks   | 142 PV/s**  |
| Optimise pipeline                     | 1 week    | 174 PV/s**  |
| New C Code + Open CL optimisations    | 2 weeks   | 387 PV/s**  |
| Scale infrastructure (4 x FPGA Cards) | 1 week    | 1511 PV/s** |



### Throughput Averaged Over 50 Consecutive Batches (1940 PDEs/batch)





# **INTEL LIBRARIES & TURNKEY EXAMPLES**

## **INTEL AI FOR COMPUTE**



A .....



## **WHY FPGAS WIN IN DEEP LEARNING**

### FIRST TO MARKET TO Accelerate evolving ai Workloads

- Adversarial Networks
- Reinforcement Learning
- Neuromorphic computing





### LOW LATENCY MEMORY Constrained Workloads

- RNN
- LSTM
- Speech WL



### FLEXIBLE SYSTEM LEVEL FUNCTIONALITY WITH DETERMINISTIC LATENCY

- AI+I/O ingest
  - Al+Networking
  - AI+Security
  - AI+Pre/Post processing

RNN – Recurrent Neural Network LSTM – Long Short-Term Memory



## **PUBLIC INTEL FPGA MACHINE LEARNING SUCCESS**





**Microsoft:** Microsoft has revealed that Intel FPGAs have been installed across every Azure cloud server, creating what Microsoft is calling the world's first AI supercomputer.

**NEC:** To create the NeoFace Accelerator, the engine software IP was integrated into an Intel Arria 10 FPGA, which operate in Xeon processor–based servers.

**JD.COM:** Arria<sup>®</sup> 10 FPGA can achieves significant improvement in the performance of LSTM accelerator card compared to GPU.

**Inspur/iFlytech:** Server vendor Inspur Group and Intel launched a speech recognition acceleration solution based on Intel's Arria<sup>®</sup> 10 FPGAs and DNN algorithm from iFLYTEK.





## **EVOLVING DEEP LEARNING REQUIREMENTS**



1.000



## INTEL<sup>®</sup> FPGA DEEP LEARNING ACCELERATION SUITE



intel

36

### **CPU + FPGA ACCELERATE AI INFERENCE**



#### GET AN EVEN BIGGER PERFORMANCE BOOST WITH INTEL® FPGA

<sup>1</sup>Depending on workload, quality/resolution for FP16 may be marginally impacted. A performance/quality tradeoff from FP32 to FP16 can affect accuracy; customers are encouraged to experiment to find what works best for their situation. Performance/quality tradeoff from FP32 to FP16 can affect accuracy; customers are encouraged to experiment to find what works best for their situation. Performance results are based on testing as of June 13, 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. For more complete information about performance and benchmark results, visit <u>www.intet.com/benchmarks</u>. **Configuration**: Testing by Intel as of June 13, 2018. Intel® Core™ 17-6700K CPU @ 2.90GHz fixed, GPU GT2 @ 1.00GHz fixed Internal ONLY testing, Test v3.15.21 – Ubnutu\* 16.04, OpenVINO 2018 RC4, Intel® Arria® 10 FPGA 1150GX. Tests were based on various parameters such as model used (these are public), batch size, and other factors. Different models can be accelerated with different Intel hardware solutions, yet use the same Intel software tools.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice: revision #20110804





### AI FUTUREPROOFING WITH BFLOAT16 SUPPORT



Carlo Carro





### FINLIB IS IN INTEL <sup>®</sup> QUARTUS <sup>®</sup> 17.0!

**Option Model and Statistical Library Technical Specifications** 

Numerical Libraries Group, Intel-PSG

March 20, 2017

Contents

| 1 | Fina         | Financial Library Functions                                                                                                                             |          |  |  |  |  |
|---|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--|--|--|--|
|   | 1.1          |                                                                                                                                                         |          |  |  |  |  |
|   | 1.2          | 1                                                                                                                                                       |          |  |  |  |  |
|   |              | underly                                                                                                                                                 |          |  |  |  |  |
|   | 1.3          | · · · · · · · · · · · · · · · · · · ·                                                                                                                   |          |  |  |  |  |
|   |              | paying underlying                                                                                                                                       |          |  |  |  |  |
|   | 1.4          |                                                                                                                                                         |          |  |  |  |  |
|   | $1.5 \\ 1.6$ |                                                                                                                                                         |          |  |  |  |  |
|   | 1.0          | Simple Cox-Ross-Rubinstein binomial tree model for American exercise options on futures un-                                                             | 6        |  |  |  |  |
|   | 1.7          | derlier                                                                                                                                                 |          |  |  |  |  |
|   | 1.7          | Bjerksund and Stensland 2002 closed-form American option pricing                                                                                        |          |  |  |  |  |
|   | 1.8          | Bjerksund and Stensland 2002 closed-form American option pricing                                                                                        |          |  |  |  |  |
|   |              |                                                                                                                                                         |          |  |  |  |  |
|   |              | Garman and Kohlhagen (1983) Currency options                                                                                                            |          |  |  |  |  |
|   |              | 11 Curran's (1994) semi-closed form model for pricing average rate options                                                                              |          |  |  |  |  |
|   |              | Kirk's (1995) semi-closed form model for pricing options on the spread between two asset prices<br>To find a value for FwdPrice given an option premium | 11<br>12 |  |  |  |  |
|   | 1.15         | To find a value for FwdFffce given an option premium                                                                                                    | 12       |  |  |  |  |
| 2 | Stat         | tatistical Library Functions                                                                                                                            |          |  |  |  |  |
|   | 2.1          | Binomial probability                                                                                                                                    | 12       |  |  |  |  |
|   | 2.2          | Binomial density                                                                                                                                        |          |  |  |  |  |
|   | 2.3          | Cumulative binomial                                                                                                                                     | 13       |  |  |  |  |
|   | 2.4          | Binomial coefficient                                                                                                                                    | 13       |  |  |  |  |
|   | 2.5          | Cumulative normal distribution                                                                                                                          | 13       |  |  |  |  |
|   | 2.6          | Cummulative normal distribution - high precision                                                                                                        |          |  |  |  |  |
|   | 2.7          | Cumulative normal distribution                                                                                                                          |          |  |  |  |  |
|   | 2.8          | Inverse standard normal distribution                                                                                                                    | 13       |  |  |  |  |
|   |              |                                                                                                                                                         |          |  |  |  |  |
|   | 2.9          | Uniform distribution                                                                                                                                    | 13       |  |  |  |  |

#### Option pricing

- European, American, Equities, Average rate, Spread normal, Spread lognormal
- Statistical functions
  - Norm\_std(), norm\_cdf(), norm\_icdf()....
- Working Demo accessible from the Intel labs

#### SCALE YOUR INNOVATION



In 17.0 release

### **GENOMICS - GATK ACCELERATION**

#### **FPGA Acceleration in GATK**

- Targets PairHMM full integration
- Latest Intel Benchmark

| Configuration                                                             | PairHMM | CPU<br>Cores<br>Used | Peak Perf<br>(GCUPS) | Average Perf<br>(GCUPS) |
|---------------------------------------------------------------------------|---------|----------------------|----------------------|-------------------------|
| 2 Socket Intel® Xeon® Processor E5<br>v4 (Note 7)                         | AVX     | 1                    | 0.699                | 0.676                   |
| 2 Socket Intel® Xeon® Processor E5<br>v4 (Note 7)                         | AVX     | 44                   | 22.0                 | 21.2                    |
| 2 Socket Intel® Xeon® Processor E5<br>v4 + Intel® Arria® 10 FPGA (Note 7) | OpenCL  | 1                    | 44.1                 | 32.4                    |



Performance results are based on testing as of March 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. System Configuration: <u>(Click for top notes</u>)



## **PARTNER LIBRARIES & TURNKEY EXAMPLES**

### **FALCON ACCELERATED GENOMICS PIPELINE**

Adaptive cross-hardware platform to provide a path to efficient & cost-effective genome analysis



#### Lower is Better

36.1

ON-

PREM



4.59

10.5

Performance Comparison for GATK WGS from Alignment to Variant Calling

**Original GATK 4.0** 

Intel GATK 3.4

Intel GATK 4.0

Falcon Accelerated

**Genomics** Pipeline

Alibaba Cloud

- □ End-to-end solution with performance optimization
- □ Accelerates GATK best practices; No proprietary pipelines

Falcon Accelerated Genomics Pipelines

- □ Supports multiple GATK versions (3.8 & 4.0)
- Both germline and somatic best practices

Falcon solution is so fast! What had taken me over a week to do on my computer cluster, I was able to do with the Falconaccelerated Genomics pipeline in a few hours Amy Cummings, MD, UCLA Medicine

#### **SCALE YOUR INNOVATION**

42

64.7

### **FPGAS OFFER UNIQUE VALUE FOR ANALYTICS/STREAMING**

#### Single Multi-function Accelerator



Offloads algorithm, networking, and data access processing

#### Moderate Acceleration is common

PCIe lookaside acceleration (two data copies)

#### Significant Acceleration requires FPGA

- Multifunction and inline w/single FPGA
- Relational: 2.3X TPC-H w/Swarm64<sup>4</sup>
  - PostgreSQL, MariaDB, MySQL, ...
- NoSQL: 4X Cassandra<sup>5</sup> w/rENIAC (80/20 R/W)
- Hadoop/Spark: 3X+ data streaming<sup>6</sup> w/Bigstream, Megh



### **DIFFERENT DATA STORE APPROACHES**





### **SWARM64 RELATIONAL DATABASE ACCELERATION**

swarm@

TWO WORKLOADS: TRADITIONAL DATA WAREHOUSING, REAL TIME DATA ANALYTICS

#### Database accelerate with a plugin





#### **Acceleration Overview**

- 20X+ single table inserts/s for real time data analytics
  - With modest tuning, 15M PostgreSQL INSERT/s
- 2.3X\* TPC-H data warehousing on Arria 10
  - 3X+ TPC-H for many CSP hosting configs
- 3X+ storage compression
  - Data & tables managed by Swarm64

Note: this is SQL to relational d/b, not SQL to semi/unstructured data.

Note \*: TPC-H SF1000, Dual Intel<sup>®</sup> Xeon<sup>®</sup> Gold 6130, 2.10 GHz, (12) 32GB DDR4-2166, (4) 960GB SSD RAIDO HPE MK000960GWJPP

Source: Swarm64

**SCALE YOUR INNOVATION** 

Constant Constant



### **r**ENIAC

### NOSQL: SYSTEM & IO ACCELERATION OPPORTUNITY source: Reniac CEO

- Connection Management
- Compression/Encryption
- Book Keeping
- Data Encode/Decode





# Business Logic: **25%**



### **r**eniac

### **RENIAC DISTRIBUTED DATA ENGINE/SWITCH (RDS)**

### 4X+ CASSANDRA (80% R/20% W), POCS OF 2X, GOING TO 4X BY FEB



#### Overview

- No customer application change
- Plug-in card with10GbE
  - Proxy tier or on database server
- Distributed cache, proxy for reads and writes
- Predictable latency for SLAs
- Roadmap for storage compaction

Significant Acceleration

- ✓ Networking/CQL acceleration
- ✓ Data access acceleration
- $\checkmark$  Compression
- √ Hashing

### **SPARK\*: SEVERAL ACCELERATION AREAS**



DataFrame

Image: Digstream
<thImage: D

BigDL: Implemented as a Standalone Library on Apache Spark\*

- Ingest/Apache Kafka\*: Extract, transform, load and filtering (BigStream, Megh)
- SQL over Apache Spark (BigStream)
- BigDL: Deep learning acceleration (Megh)
- Machine learning MLlib: e.g. ALS (Megh)
- Hadoop/Spark: Shuffle phase (A3Cube)



### **BIGSTREAM HYPER-ACCELERATION**

### DATASTREAMING POCS NOW

### Frictionless acceleration: Arria 10 and Stratix 10

- Zero code changes
- Cross platform: Spark, Kafka, TensorFlow
- Cloud or on-prem

### Intelligent and adaptive

- Automatic profiling and partitioning of computation
  - Between CPU and FPGA
- Overlay dataflow execution on FPGA
- Kafka consumer speedup up to 13X
- Spark SQL TPC-DS results
  - 4X average speedup for 26 of the queries with Arria 10
- Industry targets: FinServ/FinTech, AdTech, Healthcare
- Use cases: Spark SQL analytics, ingest/ETL, EDW



http://bigstream.co/resources/video-strata







### REAL TIME ANALYTICS STACK OPTIMIZED FOR HETEROGENEOUS CPU+FPGA PLATFORM



CPU+FPGA platform for 1 in-line processing of streaming analytics and 2 off-load processing of ML and DL to deliver >5x performance efficiency and provide actionable operational insights.

A sector





### INTEL APACHE YARN SUBMISSION: https://issues.apache.org/jira/browse/yarn-5983



**SCALE YOUR INNOVATION** 

(intel) 51

# SUMMARY

### **SUMMARY**

- FPGAs are ready for scale out and scale up
- Intel <sup>®</sup> Acceleration Stack: driver, FPGA Interfaces, virtualization, security, etc.
- Variety of Interfaces: OpenCL<sup>™</sup>, library call, framework level
- AI, Genomics, and Financial acceleration options
- Data Analytics acceleration with no change to application required
  - Relational DB, NoSQL, SPARK\* shuffle phase, Kafka\* streaming, BigDL (deep learning)
- FPGA advantage of multiple concurrent functions and inline acceleration







- (1): A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, sec 2.3 https://www.microsoft.com/en-us/research/publication/a-reconfigurable-fabric-for-accelerating-large-scale-datacenter-services/
- (2) : Microsoft's Production Configurable Cloud (Mark Russinovich) (Slide 26) https://www.slideshare.net/ChrisGenazzio/microsofts-configurable-cloud
- (3) : Accelerating Persistent Neural Networks at Datacenter Scale <u>https://www.hotchips.org/wp-content/uploads/hc\_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.60-NeuralNet1-Pub/HC29.22,622-Brainwave-Datacenter-</u> <u>Chung-Microsoft-2017\_08\_11\_2017.pdf</u>
- (4) TPC-H SF1000, Dual Intel® Xeon® Gold 6130, 2.10 GHz, (12) 32GB DDR4-2166, (4) 960GB SSD RAIDO HPE MK000960GWJPP, CentOS 7.4.1708, Kernel 3.10.0-693.21.1.e17.x86\_64, Docker 18.03.0.ce, Swarm64 DB 1.4.1-PREVIEW, PostgreSQL 10.3
- (5) Cassandra Stress Test (80% R/20% W)
- " "Dual Xeon E52670, 2.6 Ghz, 32 cores total, 64GB ram, 1 TB NVMe, Centos 7.4"
- (6) https://bigstream.co/wp-content/uploads/2017/03/Bigstream-whitepaper-v1.4.pdf



### <sup>(7)</sup> CONFIGURATION DETAILS FOR 'PAIRHMM COMPARISON - XEON/FPGA

<sup>1</sup>Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

| INTEL             |                                       |         |  |  |  |  |  |  |
|-------------------|---------------------------------------|---------|--|--|--|--|--|--|
| CPU               | Intel® Xeon CPU E5-2699, v4, 2.20 Gł  |         |  |  |  |  |  |  |
| FPGA              | Intel Arria® 10 GX, 10AX115S2F45I1SG2 |         |  |  |  |  |  |  |
|                   | ALM                                   | 427,200 |  |  |  |  |  |  |
|                   | Memory                                | 53.0 Mb |  |  |  |  |  |  |
|                   | DSP Blocks                            | 1,518   |  |  |  |  |  |  |
| Systolic Array    | 208 Processing Elements (PEs)         |         |  |  |  |  |  |  |
| FPGA              | Logic                                 | 55%     |  |  |  |  |  |  |
| Resource<br>Usage | Memory                                | 50%     |  |  |  |  |  |  |
|                   | DSP Blocks                            | 99%     |  |  |  |  |  |  |
| Frequency         | Frequency 230 MHz                     |         |  |  |  |  |  |  |
| Input Data        | Chromosome 21 from 30x WGS NA128      |         |  |  |  |  |  |  |

#### <u>INTEL</u>



### <sup>(8)</sup>\*\*SYSTEM CONFIGURATION FOR PERFORMANCE TESTING

Server configuration: Dell PowerEdge R740 2 x Intel<sup>®</sup> Xeon<sup>®</sup> Gold 6132 @ 2.6 GHz 192GB (12 x 16GB) RDIMM, 2666MT/s, Dual Rank

Operating System: Red Hat Enterprise Linux: Release 7.5 with Linux kernel 3.10.0-862.el7.x86\_64

#### FPGA:

Intel Programmable Acceleration Card with Intel Arria® 10 GX FPGA, Acceleration Stack version 1.0

Test performed during August 2018. OpenCL code was developed within Intel<sup>®</sup> Programmable Solutions Group. Functional correctness was verified by comparison with single-precision floating point results from CPU, using the "==" operator in C/C++

Tests were performed with pre-production, proof-of-concept code.

Not all capabilities are part of shipping products.

