#### Physical Design of a 3D-Stacked Heterogeneous Multi-core Processor

W. Rhett Davis, Randy Widialaksono, Rangeen Basu Roy Chowdhury, Zhenqian Zhang, Joshua Schabel, Steve Lipa, Eric Rotenberg, Paul Franzon

#### Overview

- Motivation for 3D-IC HMP
- Physical Design Methodology
  - Floorplanning
  - Powerplanning
  - Face-to-face via to signal assignment
  - Cross-tier timing analysis
  - 3D-LVS, DRC
- Comparative Analysis: 2D vs. 3D
- Conclusion & Future Work

#### Overview

- Motivation for 3D-IC HMP
- Physical Design Methodology
  - Floorplanning
  - Powerplanning
  - Face-to-face via to signal assignment
  - Cross-tier timing analysis
  - 3D-LVS, DRC
- Comparative Analysis: 2D vs. 3D
- Conclusion & Future Work

#### Thread Migration in Heterogeneous Multi-core Processors



# **3D Integration Enables FTM and CCD**





#### **2D Implementation Challenges**

- Wide inter-core interconnect consumes large amounts of routing resources
  - Mostly consumed by bus for communication between caches
- Low latency requirement
  - Using existing inter-core bus would not satisfy performance requirements
- Requires major floorplan changes to core
  - Register File and L1 Caches need to be placed at boundary, may conflict with intra-core timing requirements

Vertical interconnect in 3D integration enables shorter direct path between internal structures

# NCSU 3D Processor Timeline: 2D Chip

- Mid-2011: Architecture/circuit design, RTL verification.
- May 2013: 2D prototype tape-out in IBM 8RF 130 nm



2D test chip for testing functionality of cores, thread transfer, and cache-core decoupling logic.

## **3D Stacked Design**

#### • High performance 'big' core

• Low power 'little' core

| Parameter               | High-Performance                         | Low-Power             |  |
|-------------------------|------------------------------------------|-----------------------|--|
|                         | (Top Die)                                | (Bottom Die)          |  |
| Frontend Width          | 2                                        | 1                     |  |
| Issue Width             | 3                                        | 3                     |  |
| Pipeline Depth          | 9                                        | 9                     |  |
| Issue Queue Size        | 32                                       | 16                    |  |
| Physical Reg. File Size | 96                                       | 64                    |  |
| Load/Store Queue Size   | 16/16                                    | 16/16                 |  |
| Reorder Buffer Size     | 64                                       | 32                    |  |
| L1 I-Cache              | private, 4 KB, 1-way, 8 B block, 1 cycle |                       |  |
| L1 D-Cache              | private, 8 KB, 4-way                     | , 16 B block, 2 cycle |  |



#### • Process:

- GF 130 nm
- Ziptronix face-to-face bonding 8 micron via pitch
- 3 micron diameter
- MPW with Princeton Univ.

#### Overview

- Motivation for 3D-IC HMP
- Physical Design Methodology
  - Floorplanning
  - Powerplanning
  - Face-to-face via to signal assignment
  - Cross-tier timing analysis
  - 3D-LVS, DRC
- Comparative Analysis: 2D vs. 3D
- Conclusion & Future work

#### **NC STATE UNIVERSITY**

## **Physical Design Flow**



- Flow begins with partitioned netlist, synthesized separately
- Followed by floorplanning, powerplanning, and placement of first tier
- Placement of the second tier depends on placement of first tier
- Second tier consists of 'small' core and is easier to converge

#### Custom tool/flow Developed in-house

# Floorplan



(a) Top Die



(b) Bottom Die



# Powerplan

- Robust power delivery network
  - Based on static IR drop analysis of 2D prototype
  - Wider power rings/stripes, more power stripes
  - Additional metal layers for power ring
- Maximize cross-tier power delivery through the F2F interface
  - Distance between power rings and stripes were multiples of the F2F via pitch
  - Ensures perfect alignment of F2F vias and power stripes
- A custom "power via stack" cell connects F2F bonds with power grid

Maximum current draw for a FabScalar core: **154.17 mA** ( **185 mW / 1.2 V**) Current carrying capacity through the 30,796 power vias: **3,880.29 mA** 

# Face-to-face Via Assignment



- First priority is to assign F2F vias for power delivery
  - Every F2F via located above power stripes were allocated for power
  - Exclude vias located above memory macros
- Inter-tier signals were assigned using a greedy nearest-neighbor algorithm as a heuristic to optimal assignment
- Nearest-neighbor query speed-up with k-d tree structure [7], implemented with Scientific Python (SciPy) library

# Face-to-face Via Assignment



- The main information to the assignment problem are:
  - Pin locations/Cell placement of inter-tier signal sink/source
  - 3D (F2F) via locations
- Possible enhancements to the assignment algorithm:
  - Congestion awareness [Neela, 3D-IC '14] (our approach was to exclude vias in congested regions)
  - Timing slack awareness for prioritizing timing critical nets [8]

# **Cross-tier Timing Analysis**

- Each core operates with its own independent clock
  - Except during thread migration: synchronous state transfer between Teleport Register File
- Clock forwarding means inter-tier timing synchronization
  - Need to consider process variations across wafers (wafer-to-wafer stacking)
- Post layout timing analysis using PrimeTime
  - Two dies wrapped into a single system
  - Analyzed cross-tier paths, the two dies at opposite timing corners
- Performed manual hold timing fixes through ECO

# **Physical Verification: 3D-LVS, DRC**

- 3D LVS verifies inter-tier signal assignment
  - Connectivity verification was necessary due to manual, post place/route changes for DRC cleanup and timing ECO
  - DRC cleanup includes adding more antenna diodes
    - Automated insertion was performed during place and route
    - Post P&R antenna violations occur on a handful of long wires
- 3D DRC, developed custom Calibre rules to verify:
  - Top metal layer consists of F2F via grid shapes with correct dimension, offset, and pitch
  - Correct dimensions of every shape in TSV related layers

#### Overview

- Motivation for 3D-IC HMP
- Physical Design Methodology
  - Floorplanning
  - Powerplanning
  - Face-to-face via to signal assignment
  - Cross-tier timing analysis
  - 3D-LVS, DRC
- Comparative Analysis: 2D vs. 3D
- Conclusion & Future work

#### **NC STATE UNIVERSITY**

#### 2D vs 3D Register File Layout



- Heavy routing congestion shown in routing inter-core signals out from the partition to the right edge
- This routing congestion increases power consumption and area
- Wide bus signals are prone to cross talk
- Exacerbated by distance between inter-core structures

# **Comparative analysis: 2D Floorplans**



2D-Inter: floorplan optimized for inter-core structures

2D-Intra: floorplan from a 3D tier, optimized for intra-core timing

# **Average Wirelength Comparison**





- Overall 3D wirelength benefits:
  - 8.8%,18% vs 2D-inter, 2D-intra
- Average wirelength of TRF inter-tier signals reduced by
  - ~1 mm vs 2D-inter
    - 2D-inter requires more area/routing resources for DRC clean design due to congestion and crosstalk.
- Further leverage available
  F2F vias by enabling intercore state transfer features to more core structures (e.g. branch target buffer, map tables).
  - F2F via utilization of 3D chip at 25% in core area (21% for power delivery).

#### **CCD Path Delay Comparison**



Path delays of inter-core cache datapaths (ns)

- With a target clock cycle period of 15 ns, using 3D yields ~ 5 ns lower path delay.
- Comparison between 2D-intra with/without signal integrity analysis shows crosstalk effects in a 2D implementation

# Impact of Vertical Interconnect on Routing Congestion

|                       |                                          |     |                                |          |              | -     |
|-----------------------|------------------------------------------|-----|--------------------------------|----------|--------------|-------|
|                       |                                          |     |                                |          |              |       |
|                       |                                          |     |                                |          |              |       |
|                       |                                          |     |                                | ×        |              | _     |
|                       |                                          |     |                                |          |              | _     |
|                       | 8 8 8 8 8 8 8                            |     |                                | 8        |              | _     |
|                       |                                          |     |                                |          |              | _     |
| $\boxtimes \boxtimes$ |                                          |     |                                |          |              | _     |
| 88                    | 3 12 12 12 12 12 12 12                   |     | 1 2 3 3 3 3 3 4 5 3 7 8 3      | 8        |              | _     |
| 🖾 🖾 I                 |                                          | ×   |                                |          |              | _     |
| 88                    |                                          | - 2 | 88888888888                    |          |              |       |
| 22                    |                                          |     |                                |          |              | _     |
| 88                    |                                          | ×   | 888888888888                   | 8        | 88888888888  | - 🖾   |
| 88                    | 8888888                                  |     |                                |          |              |       |
| 88                    | 888888                                   | 8   | _ <u>8</u> 8 8 8 8 8 8 8 8 8 8 |          |              |       |
| 88                    | 2 12 12 12 12 12 12 12 12 12 12 12 12 12 | 8   |                                | 8        | 888888888888 |       |
| 🖾 🖾 I                 | 8 🛛 🖓 🖓 🖾 🖾 🖄 🚽                          | ×   | <u> </u>                       | ₩        | <u> </u>     | - 🖂   |
| 88                    | 8888888                                  |     |                                |          |              | - 🖾   |
|                       |                                          | 8   |                                |          |              |       |
|                       |                                          | ×   |                                | <b>X</b> |              |       |
|                       |                                          |     | 888888888888                   |          |              |       |
| 88                    |                                          | 8   |                                |          |              |       |
|                       |                                          |     |                                | 8        |              | - 123 |
|                       |                                          | ×   |                                | <b>1</b> |              |       |
|                       |                                          |     |                                |          |              | - 🛛   |
|                       |                                          |     |                                |          |              |       |
|                       | 8 12 12 12 12 12 12 12 12 12 12 12 12 12 | Ø   |                                | 8        |              |       |
|                       |                                          |     |                                |          |              |       |
| 88                    |                                          | 8   |                                | - 🛛      |              | -     |
|                       |                                          | 8   |                                | 8        |              | 8     |
| × Cont                |                                          |     |                                |          |              |       |
| 81 B(                 |                                          | Ø   |                                | Ø        |              | ×     |
|                       |                                          | 8   |                                | Ø        |              | 8     |
|                       |                                          |     |                                |          |              |       |

- Vertical via stacks could cause routing congestion, since it consumes routing resources from the bottom to the top layer.
- Learnings:
  - Monitor cell density and via assignment for routability. Look for routing detours as shown during timing closure.
  - Analyze the cell placement of inter-tier signals source/sink. Not every fan-out cell can be clustered near the via, they may be spread out due to internal timing constraints.
  - Consider both area and routing impact of antenna diode insertion, such as by allocating more area for the partition.

# Wirelength Benefits of Finer F2F Via Pitch



Fig. 4. Impact of face-to-face via pitch on wirelength of *Teleport Register File* inter-tier signals.

#### Overview

- Motivation for 3D-IC HMP
- Physical Design Methodology
  - Floorplanning
  - Powerplanning
  - Face-to-face via to signal assignment
  - Cross-tier timing analysis
  - 3D-LVS, DRC
- Comparative Analysis: 2D vs. 3D
- Conclusion & Future Work

## Conclusion

- 3D integration mitigates competing interest between internal and inter-core timing constraints
- 3D integration can reduce total/average wirelength, but may introduce routing congestion due to the routing resources consumed by vertical via stacks.
- Antenna/ESD diodes for face-to-face vias incurs area and routing overhead. These diodes may increase load capacitance, and system power consumption.
- Observed diminishing return of wirelength reduction on finer F2F via pitch.

#### **Future Work**

- 3D-IC EDA tool development for 3D power delivery network, physical verification
- Static timing analysis tool support to conduct inter-tier timing analysis and cross-tier timing ECO
- Model to help determine ideal F2F via pitch based on design parameters (e.g. connectivity, standard cell size)
- Enhancing inter-tier signal-via assignment by exploring/combining heuristics (total wirelength, congestion, timing)

#### References

[1] E. Rotenberg, B. H. Dwiel, E. Forbes, Z. Zhang, R. Widialaksono, R. B. R. Chowdhury, N. Tshibangu, S. Lipa, W. R. Davis, and P. D. Franzon, "Rationale for a 3d heterogeneous multi-core processor," in Computer Design (ICCD), 2013 IEEE 31st International Conference on, pp. 154–168, 2013. ID: 1.

[2] E. Forbes, Z. Zhang, R. Widialaksono, B. Dwiel, R. B. R. Chowdhury, V. Srinivasan, S. Lipa, E. Rotenberg, W. R. Davis, and P. D. Franzon, "Under 100-cycle thread migration latency in a single-isa heterogeneous multi-core processor," in 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–1, Aug 2015.

[3] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg, "FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores Within a Canonical Superscalar Template," in Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA-38, pp. 11–22, June 2011.

[4] P. Enquist, "Scalable direct bond technology and applications driving adoption," in 3D Systems Integration Conference (3DIC), 2011 IEEE International, pp. 1–5, Jan 2012.

[5] D. Chapman, "Diram architecture overview," Tezzaron Semiconductors, 2014.

[6] V. Srinivasan, "Phase ii implementation and verification of the h3 processor," Master's thesis, North Carolina State University, 2015.

[7] R. Widialaksono, W. Zhao, W. R. Davis, and P. Franzon, "Leveraging 3d-ic for on-chip timing uncertainty measurements," in 3D SystemsIntegration Conference (3DIC), 2014 International, pp. 1–4, Dec 2014.

[8] R. Widialaksono, Three-Dimensional Integration of Heterogeneous Multi- Core Processors. PhD thesis, North Carolina State University, Raleigh, June 2016.

[9] Z. Zhang and P. Franzon, "Tsv-based, modular and collision detectable face-to-back shared bus design," in 3D Systems Integration Conference (3DIC), 2013 IEEE International, pp. 1–5, Oct 2013.

[10] Z. Zhang, Design of On-chip Bus of Heterogeneous 3DIC Micro-processors. PhD thesis, North Carolina State University, Raleigh, June 2016.

[11] G. Neela and J. Draper, "Techniques for assigning inter-tier signals to bondpoints in a face-to-face bonded 3DIC," in 3D Systems Integration Conference (3DIC), 2013 IEEE International, 2013, pp. 1–6.

# **Q & A**

| Physical Design Metrics             |                   |  |  |  |
|-------------------------------------|-------------------|--|--|--|
| Die Dimensions                      | 3.92 mm x 3.92 mm |  |  |  |
| Core Area per die                   | $9.57 \ mm^2$     |  |  |  |
| Standard Cells (top die)            | 886,361           |  |  |  |
| Standard Cells (bottom die)         | 678,854           |  |  |  |
| Memory macros                       | 34                |  |  |  |
| Nets (top die)                      | 482,479           |  |  |  |
| Nets (bottom die)                   | 328,535           |  |  |  |
| Average net length (top die)        | $64.6 \ \mu m$    |  |  |  |
| Average net length (bottom die)     | 66.9 $\mu m$      |  |  |  |
| Inter-tier F2F signal nets          | 6,077             |  |  |  |
| Inter-tier power vias               | 30,796            |  |  |  |
| Average F2F net length (top die)    | $86 \ \mu m$      |  |  |  |
| Average F2F net length (bottom die) | 140.3 $\mu m$     |  |  |  |

- 3D-IC cost
  - Engineering effort
    - 3D clock distribution, power, thermal issues, design for test
    - Develop new design automation tools/flows

# **Register File**

Architectural RF and Teleport RF placement were adjacent

Subsequently called PRF





# Detailed 3D–IC flow for multiple experiments

