### ALL PROGRAMMABLE



5G Wireless • Embedded Vision • Industrial IoT • Cloud Computing



FPGA accelerated processing in the cloud Cathal McCabe, Xilinx Ireland New concepts in ultra fast data acquisition workshop

© Copyright 2018 Xilinx

### Overview

- Computing after Moore's law
- Big data
- > The rise of AI
- > From cloud to edge and back

### The legal speak

### Moore's Law

- Number of transistors doubles every 1/1.5/2 years
- Moore's Second Law (Rock's law)
  - Cost of fabs increases exponentially
- Moore's Law's Law
  - Number of people predicting the end of Moore's law doubles every year!

### Dennard's Law (Scaling)

 As transistors get smaller their power density remains constant



### Transistors, Clock Speed, Power



© Copyright 2018 Xilinx

#### 

### Computing performance increase

Slowing to 3% per year



\*John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e. 2018 XILINX > ALL PROGRAMMABLE.

© Copyright 2018 Xilinx

## Data scaling



## The world around us hasn't stopped scaling





|                | March 2015 |        | September 2014 |        |  |
|----------------|------------|--------|----------------|--------|--|
| Netflix        |            | 36.48% |                | 34.89% |  |
| YouTube        | 15.56      |        | 14.04          |        |  |
| HTTP           | 6.02       |        | 8.62           |        |  |
| iTunes         | 3.36       |        | 2.77           |        |  |
| BitTorrent     | 2.76       |        | 2.8            |        |  |
| Facebook       | 2.65       |        | 2.98           |        |  |
| MPEG - Other – | 2.07       |        | 2.66           |        |  |
| Amazon Video   | 1.97       |        | 2.58           |        |  |
| Hulu           | 1.91       |        | 1.41           |        |  |
| SSL - Other    | 1.91       |        | 2.14           |        |  |

#### Trends driving the need for Data Centers



Market Realist @

Source: Cisco, Gartner, IDC

\*Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2015–2020 White Paper

© Copyright 2018 Xilinx

### Scientific data growth





### Data centres



Hohhot Data Center in China 7,750,015 square feet



- >7500 data centres worldwide ++21% per year in 2018
- By 2020 1/3 of all data will pass through the cloud
- 40% of total operating costs is Energy

> 3% of global electricity production

## Al and Machine Learning



### Dark data

Less than 1% of all data that is produced every day is mined for valuable information

- > "We cannot solve our problems with the same thinking we used when we created them."
  - Albert Einstein





C Home

# The Great A.I. Awakening

How Google used artificial intelligence to transform Google Translate, one of its more popular services — and how machine learning is poised to reinvent computing itself.

BY GIDEON LEWIS-KRAUS DEC. 14, 2016



#### **EXILINX >** ALL PROGRAMMABLE.

448

## Google Translate Significant Quality Upgrade through Machine Learning



# Google

Translate

| German                                                                                                                                                                | Chinese                                               | English                            | Eng                  | lish -               | detec        | ted       | *   | +     |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|------------------------------------|----------------------|----------------------|--------------|-----------|-----|-------|
| It seems every day there is news of another ×<br>breakthrough in artificial intelligence. How<br>will the detector community take advantage<br>of these new advances? |                                                       |                                    |                      |                      |              |           |     |       |
| •) /                                                                                                                                                                  |                                                       |                                    |                      |                      |              |           | 154 | 4/500 |
| English                                                                                                                                                               | Italian F                                             | rench 👻                            |                      | Tran                 | slate        |           |     |       |
| nouve<br>l'intelli<br>comm                                                                                                                                            | ble que<br>lles d'ur<br>gence a<br>unauté<br>puvelles | e nouve<br>rtificielle<br>des déte | elle<br>e. C<br>ecte | perc<br>omm<br>urs p | ée d<br>nent | ans<br>Ia |     | le de |

☆ □ •) <

### Machine Learning



"Machine learning can solve fundame business problems that are really har create hardwired solutions to"

Danny Lange (Uber)



Al saves \$1 Billion on content every year"

Al help

Google + TayTweets 🥏 @TayandYou @UnkindledGurg @PooWithEyes chill im a nice person! i just hate everybody 24/03/2016, 08:59 *"Twitter taught Microsoft's AI chatbot to be a* racist ... in less than a day"

### Al in big science





"It took us several years to convince people that this is not just some magic, hocus-pocus, black box stuff," Boaz Klima, (Fermilab)

Application of machine learning techniques to lepton energy reconstruction in water Cherenkov detectors







Inter-experimental LHC Machine Learning (IML) Working Group @ CERN GFA Accelerator Seminars

Applications of Machine Learning in Particle Accelerators

by Rasmus Ischebeck (Paul Scherrer Institut)

Monday, 19 March 2018 from **16:00** to **17:00** (Europe/Zurich) at PSI (WBGB/019)

#### Luengo I, et al. <u>SuRVoS: Super-Region Volume Segmentation workbench.</u>" Journal of *Structural Biology*(2017). DOI:10.1016/j.jsb.2017.02.007 A Perspective on Deep Imaging Ge Wang

Page 15

© Copyright 2018 Xilinx

### Latest Research: Increasing Accuracy of Reduced Precision CNNs & BNNs



### Xilinx technology



### **FPGA** Architecture Overview



#### 

### Adaptable Architecture Advantage





- > Algorithm to implement
  - Control / Dataflow graph
- > CPU implementation
  - Sequential (Van Neumann) execution (SIMD for GPU)
  - Memory access bottleneck
  - Fixed data size (e.g. 64 bits)
  - Poor decision handling (breaks ALU pipeline)
- > FPGA implementation
  - Custom dataflow / pipeline / decision handling / widths
  - Custom memory hierarchy
  - Energy efficient computation

#### 

### **FPGA** evolution

2.5 micron XC2064 1984 85,000 transistors



#### Maximum Density of Xilinx FPGA by node



### 7nm Everest 2018 50,000,000,000+ transistors





#### **EXILINX >** ALL PROGRAMMABLE.

© Copyright 2018 Xilinx

### **FPGA** Scaling

#### Maximum Xilinx SerDes Rate per pin



High End PC per pin DDR rate



#### Aggregate Xilinx SerDes Bandwidth



FPL 2014: The FPGA, an engine for innovation in silicon and packaging technology; Liam Madden, FPL 2014

© Copyright 2018 Xilinx

### Xilinx Product Families: A Broad Portfolio



© Copyright 2018 Xilinx

### UltraScale+<sup>™</sup> Capabilities



## UltraScale Re-Architects the Core Highest Utilization at Maximum Performance

### Next Generation Routing

- Re-designed routing architecture
- · 2X routing, agile switching
- · Co-Optimized with Vivado



**ASIC-Like Clocking** 

- · Regional, segmented structure
- Flexible clock placement
- · Scales w/density to balance skew



### System Logic Cells

- Higher utilization enabled by routing
- · Shorter net delays for performance
- · Less wire switching for lower power



© Copyright 2018 Xilinx

### SSI Harnesses Proven Technology in a Unique Way



### Introducing Virtex UltraScale+ HBM Devices 20X more bandwidth than a DDR4 DIMM



### Covering the Full Spectrum of Memory Solutions



### **Next generation Adaptive Compute Acceleration Platform**

- New Device Category for Adaptive Workload-Specific Acceleration
- > HW/SW programmable engines
- > IP subsystems and a network-on-chip
- > Highly integrated programmable I/O





### Software tools and libraries



### Up-leveling the Programming Model



### Development Stacks for SoC & FPGA



### Advantages of Xilinx Devices

| <ul> <li>Reduced System Power Costs</li> <li>Highly efficient compute across range of workloads/applications</li> <li>Single device for range of processing &amp; interfacing needs</li> </ul>                                                                            |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Reduce System Hardware Costs</li> <li>FPGAs massive compute replaces CPU</li> <li>Future proofed hardware thanks to massive flexibility</li> </ul>                                                                                                               |
| <ul> <li>Platform solutions</li> <li>Massive flexibility allows processing need for range of applications be met</li> <li>Massive IO flexibility ensures FPGA/SoC can hook into system easily</li> <li>Large range of devices to fit into system power envelop</li> </ul> |
| Ultimate Flexibility/Future Proofed HW <ul> <li>Xilinx's flexibility &amp; ability to handle diverse processing requirement</li> </ul>                                                                                                                                    |

- GPUs need high data locality, massive parallelism & specific data type + limited IO support
- Tomorrows algorithms will run efficiently on Xilinx devices

### FPGAs in the cloud



## FPGA as a Service Expanding Worldwide



### Amazon F1 Instances



| Model       | #FPGA | Mem    | SSD Storage | FPGA DDR4   | Price / hour |
|-------------|-------|--------|-------------|-------------|--------------|
| f1.2xlarge  | 1     | 122 GB | 470 GB      | 4x16 GB     | \$1.65       |
| f1.16xlarge | 8     | 976 GB | 8 x 470 GB  | 8 x 4x16 GB | \$13.20      |

### F1 Users

"I wasn't aware the service I am using involved F1 and FPGAs."

### **Tools**

User's front-end application leverages F1 transparently



end user

"I need to accelerate an application. I don't know RTL and hardware design."

#### F1 developer #1



"I want to create or reuse RTL kernels while using **standard APIs** whenever possible."

#### F1 developer #2



"I want to create or reuse my **RTL designs** while designing HW and SW middleware."

#### F1 developer #3

SDAccel •

SDAccel

•

✓ Host: Xilinx OpenCL runtime

✓ Host: Xilinx OpenCL runtime

✓ Kernel: C, C++ or OpenCL

✓ Kernel: RTL leveraging Vivado

#### AWS HDK/SDK •

- ✓ Host: Custom API
- ✓ Kernel: RTL or HLx

### **AWS EC2 F1 Instance SDAccel Flow**



### Xilinx tools in the cloud



| Currently selected: c4.8xlarge (132 ECUs, 36 vCPUs, 2.9 GHz, Intel Xeon E5-2666v3, 60 GiB memory, EBS only)<br>Note: The vendor recommends using a c4.4xlarge instance (or larger) for the best experience with this product. |                   |   |            |             |              |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|---|------------|-------------|--------------|
|                                                                                                                                                                                                                               | Family            | Ŧ | Туре –     | vCPUs (j) - | Memory (GiB) |
|                                                                                                                                                                                                                               | Compute optimized |   | c5.large   | 2           | 4            |
|                                                                                                                                                                                                                               | Compute optimized |   | c5.xlarge  | 4           | 8            |
|                                                                                                                                                                                                                               | Compute optimized |   | c4.xlarge  | 4           | 7.5          |
|                                                                                                                                                                                                                               | Compute optimized |   | c4.2xlarge | 8           | 15           |
|                                                                                                                                                                                                                               | Compute optimized |   | c4.4xlarge | 16          | 30           |
|                                                                                                                                                                                                                               | Compute optimized |   | c4.8xlarge | 36          | 60           |

### FPGA developer AMI

- Includes all Xilinx tools required for F1 development
- Run on standard AWS compute instances
- Spin up as many instances as you need on-demand
  - No Xilinx software/license maintenance
  - Reduces IT infrastructure requirements

### XILINX > ALL PROGRAMMABLE...

## **Amazon F1 Development Flow**



### Benefits of the AWS F1 Cloud Compute Platform

- Accelerated computation
- Makes leading-edge FPGA acceleration available to a large <u>community of</u> <u>developers</u>, and to millions of potential <u>users</u>
- Provides dedicated and large amounts of leading-edge <u>FPGA logic with</u> <u>elasticity</u> to scale to multiple instances
- Simplifies the development process by providing <u>cloud-based tools</u> for FPGA development
- Ideal platform for <u>collaborative research and development</u> build a prototype and share instantly with partners

# Workloads





### **EXILINX >** ALL PROGRAMMABLE.

## FireSim - Cycle-accurate, FPGA-accelerated data center simulation project based on RISC-V

- > Uses public-cloud F1
  - No upfront cost to purchase and deploy FPGA hardware
  - Distribute pre-built images
    - Easy to reproduce experiments\*
    - Automates FPGA simulation
  - Scale out experiments by spinning up additional EC2 instances
  - "Saves hundreds of thousands of dollars on large FPGA clusters"



\*Try for yourself:



### FireSim Demo v1.0

Sold by: Berkeley Architecture Research

Latest Version: 1.0

This image includes an AMI and AFI to demo FireSim, a fast, cycle-accurate FPGA-accelerated hardware simulation tool. This release can simulate a single-node or eight-node cluster of

© Copyright 2018 Xilinx



### From cloud to edge and back





## "Python Productivity for Zynq"

WNQ-Z1



#### BNN on Pynq

This notebook covers how to use Binary Neural Networks on Pynq. It shows an example of image recognition with a binarized neural inspired at VGG-16, featuring 6 convolutional layers, 3 Max Pool layers and 3 Fully connected layers

#### 1. Instantiate a Classifier

Creating a classifier will automatically download the correct bitstream onto the device and load the weights trained on the specified dataset. By default there are three sets of weights to choose from - this example uses the CIFAR10 set.

#### In [1]: import bnn print(bnn.available\_params(bnn.NETWORK\_CNV))

princ(onnaralizable\_params(onnnernonn\_onn)

classifier = bnn.CnvClassifier('cifar10')

['streetview', 'road-signs', 'cifar10']

#### 2. List the available classes

The CIFAR10 dataset has 10 classes of images, the names of which are accessible through the classifier.

In [3]: print(classifier.bnn.classes)

['Airplane', 'Automobile', 'Bird', 'Cat', 'Deer', 'Dog', 'Frog', 'Horse', 'Ship', 'Truck']

#### 3. Open image to be classified

Download a JPEG image of a car and place it in the home directory for the xilinx user. The image can then be loaded and displayed through the notebook

#### In [4]: from PIL import Image import numpy as np

im = Image.open('/home/xilinx/jupyter\_notebooks/bnn/airplane.jpg')





### € XILINX ➤ ALL PROGRAMMABLE...

te te te te

© Copyright 2018 Xilinx

## Computing Landscape: ... A Linear Spectrum from IOT to Cloud

IoTT .. IIoT .. Embedded .. Mobile .. Desktop .. Server .. Data Center .. Cloud

- IIoT: Industrial Internet of Things
- IoTT: Internet of Tiny Things (aka motes, ultra low-power/energy)

## Computing Landscape: ... From Linear Spectrum to Continuum



🛿 🐔 XILINX 🕨 ALL PROGRAMMABLE.,

## PYNQ enables hardware, software and analytics



### 🗶 XILINX 🕨 ALL PROGRAMMABLE.

### Summary

- > Delivering more than Moore
  - Ultrascale+, Everest
- > Adaptable computing for AI
- Cloud solutions for big data problems
  - -AWS EC2 F1
- From cloud to edge and back with PYNQ



https://www.xilinx.com/products/design-tools/acceleration-zone/aws.html





