Hardwiring the OS Kernel into a Java Application Processor

1 Hardwiring the OS Kernel into a Java Application Proces...
Author: Aubrey Wade
0 downloads 0 Views

1 Hardwiring the OS Kernel into a Java Application ProcessorChun-Jen Tsai, Cheng-Ju Lin, Cheng-Yang Chen, Yan-Hung Lin, Wei-Jhong Ji, and Sheng-Di Hong National Chiao Tung University Hsinchu, Taiwan 10/July/2017

2 Outline Introduction Proposed Application Processor SoCExperimental Results Future Work and Conclusions

3  Is this the only option?Introduction An Operating System (OS) is composed of the runtime libraries and the kernel, the latter manages resources: Processor core(s) Memory File systems I/O and accelerator devices Since the creation of the first OS in 1956, the operating systems are always done in software  Is this the only option?

4 Traditional vs Proposed System Arch.Applications Applications The System Middleware: APP lifecycle control UI Media/Comm. services Events handling The System Middleware: Software APP lifecycle control UI Media/Comm. services Events handling The OS kernel Hardware Processor SoC Processor SoC OS kernel HW

5 Rationales for a HW OS Kernel (1/3)Advantages in thread management: Context preparation can be concurrent to thread execution Processor runtime statistics can be integrated into the scheduler for better decisions Ideally, the concept of Simultaneous Multi-Threading (SMT) can be transcendent from the microarchitecture level to the system level across multiple cores

6 Rationales for a HW OS Kernel (2/3)Advantages in memory management: Memory allocator can perform parallel search for a fitting memory block De-fragmentation of free memory block can be done concurrently to program execution Enable programmer-transparent support of a multi-level, multi-bank memory subsystem Possibly, MMU and cache controller (2nd-level and above) can be merged into the memory manager

7 Rationales for a HW OS Kernel (3/3)Advantages in device management: The device manager can adopt a point-to-point communication paradigm The overhead of expensive interrupt-driven operations can be significantly reduced Integration between the processor and the accelerator can be more tightly-coupled In the extreme case, the device manager circuit is like a dedicate low-power I/O processor, without any instruction-set architecture (ISA) overhead

8 HW OS Kernel for RISC Core?HW OS kernel for a RISC core is possible, but current ISA’s have some issues: No instructions for thread creation/registration No instructions for memory allocation/deallocation Event handling only through interrupt handling routines Question: Are there other alternatives?

9 Java ISA for Hardwiring the OS KernelJava ISA was selected to realize a hardwired OS kernel in this work for several reasons: The Java language defines API for OS kernel services Java adopts dynamic resolution of “code” and “data” Java enforces the objected-oriented programming model Enables more potentials for HW memory manager!

10 Proposed Application Processor SoCJAIP – Java Application IP core

11 The Boot Process of JAIPThe Power-On Boot Logic initializes the JAIP system: Power-On Boot Logic JAIP Core Dynamic Resolution Tables Bytecode Execution Engine Initial Symbol Table Symbol Cache Method Cache Pre-parsed Classes (to support the class-parser class and the boot class) 43.5KB Main memory (DDR3-DRAM) Initial heap memory state 2nd-level method area (loaded class runtime images) Symbol table of all loaded classes Heap space for Java objects

12 Dynamic Resolution in JAIPDynamic Symbol Resolution Unit (DSRU) for a Java VM is similar to the MMU of a RISC/CISC processor MMU translates virtual addresses to physical addresses DSRU translates Java literal names to physical handle Physical code/data space Java name space DDR memory abc DSRU getfield abc putfield def invokevirtual pqr invokevirtual xyz pqr On-chip SRAM def Accelerator xyz

13 Thread Manager in JAIP JAIP Uses a ping-pong stack buffer to enable zero-cycle overhead of context switching Runtime stack selection Java Stack 0 Bytecode execution operation 0 1 0 1 To bytecode execution engine 1 0 Java Stack 1 1 0 Thread manager state saving Thread manager state loading

14 Time Quantum of JAIP The time quantum defaults to 20 secs, during this period of time, the thread manager must: Store the state of the previous thread to the main memory Restore the state of the next thread from the main memory If the thread store/restore time takes longer than 20 secs, the time quantum of the current thread will be extended The context switching overhead still takes 0 clock cycle

15 Memory Manager in JAIP The memory manager is composed of a memory allocator and a garbage collector: Heap Management Unit Dynamic Symbol Resolution Unit Object Allocation Controller Memory Allocation Table (dual-port SRAM) address object size Garbage Collection Stack Memory Bytecode Execution Engine GC Controller heap request

16 Features of Allocator and CollectorThe memory allocation table (MAT) maintains a list of free memory blocks Object Allocator (OA) and Garbage Collector (GC) can access the MAT concurrently OA searches for the first free block with size S1 that is larger than the request size S2, split it if S1 – S2 >  GC collects any local objects that is created in a method upon its return when the object is no longer referenced A call-stack memory is used for this purpose During GC, two consecutive free memory blocks will be merged into one

17 Memory-Mapped I/O DevicesDevice Manager in JAIP Device manager in JAIP is tied to the DSRU JAIP Core Cache Controller Bus Arbiter Memory-Mapped I/O Devices DSRU Native HW Interface Mem. Copy Accelerator String Accelerator JAIPPointer External Access Unit DDR3 Memory Controller AXI4 System Bus CST XRT Dynamic resolution tables

18 Experimental Results The JAIP SoC has been implemented on a Xilinx Kintex-7 device with a target clock at 100MHz The comparison point is a PowerPC 100MHz running CVM-JIT on Linux

19 Benchmark Programs The Jembench Benchmark Suite† is used to test the multi-threading performance Additional programs are used to test the memory allocation and the exception handling performances Jembench contains several benchmarks: Micro, kernel, application, parallel, and stream † M. Schoeberl, T. B. Preusser, and S. Uhrig, “The Embedded Java Benchmark Suite JemBench,” Proc. of JTRES'10, Prague, Czech Republic, Aug , 2010.

20 Kernel and Application BenchmarksThese benchmarks test the efficiency of the bytecode execution engine and the cache subsystem: Bytecode-operations performance (ops/sec): Sieve Bubble Sort Kfl Lift UdpIp JAIP 9,858 6,272 30,681 30,173 16,078 CVM-JIT 15,398 8,587 44,401 49,017 19,469 Bytecodes JAIP CVM-JIT iload_3 iadd 99,864,000 77,672,000 iload_3 idiv 3,153,000 2,713,000 ldc iadd 12,483,000 48,913,000 aload iaload 14,266,000 9,742,000 if_cmplt not taken 33,288,000 32,704,000 if_cmplt taken 19,972,000 24,385,000

21 Parallel and Stream BenchmarkThe Parallel and Stream benchmark tests multi-threading performances: Synchronizations are much faster with JAIP: JAIP CVM-JIT #threads 1 4 8 16 Dummy 181 148 110 108 385 61.3 31.2 15.4 Matrix 238 233 193 144 269 62.1 31.1 15.6 N-Queens 134 115 109 84.6 117 30.6 7.32 AES 286 146 69.1 34.4 798 180 1.55 0.84 Operations JAIP CVM-JIT Synchronized (this) 2,325,000 977,237 Synchronized method 2,853,000 744,727

22 Benchmarking “new array” OperationsThe average #clocks per each “new array object” Based on 1024 creations of random length array objects Array size JAIP CVM-JIT Speed up 1 ~ 32 391 1,270 3.25  1 ~ 64 586 1,563 2.67  1 ~ 128 1,074 2,344 2.18  1 ~ 256 2,441 3,613 1.48  1 ~ 512 4,590 7,324 1.60  1 ~ 1024 8,790 12,794 1.46  Numbers are in clock cycles

23 Java Exception Handling PerformanceJava exception handling is more efficient with JAIP: Single-level exception routine search (# clock cycles): Multiple-level exception routine search: # Searched Routines JAIP CVM CVM-JIT 1 11 2,017 3,920 5 73 2,300 4,487 10 158 2,773 4,818 15 243 3,260 5,082 # Searched Callers JAIP CVM CVM-JIT 5 197 3,741 7,548 10 435 6,224 11,837 15 675 8,481 15,567

24 “Future” Work New multi-level priority queue scheduler is being implemented New memory manager with a reference count-based garbage collector is being implemented Four-core JAIP-MP will be integrated soon JAIP 0 Thread Manager JAIP 1 JAIP 2 JAIP 3 System bus Multicore coordinator Global Thread Manager Data-Coherence Controller Power-On Boot Logic Memory controller DDR3 SDRAM IPC

25 Conclusions Better system-level optimizations can be achieved when both the OS kernel and the processor cores are designed together in hardware Potential of object-oriented memory model may be important for data-intensive parallel applications Hardwired OS kernels enable parallel operations of system tasks and more flexibility in power-gating of SoC components