Hi, my name is

Michalis Papadimitriou.

I run Java on GPUs.

I'm a software engineer building GPU-accelerated runtimes, compilers, and LLM inference engines for the JVM. Currently a Senior Software Engineer at Neo4j and a Research Fellow at the University of Manchester, where I help build TornadoVM and GPULlama3.java.

Get In Touch

About Me

Hello! I'm Michalis — a software engineer who likes making the JVM go fast on hardware it was never designed for. My work sits at the intersection of compilers, managed runtimes, and heterogeneous computing: getting Java programs to run efficiently on GPUs, FPGAs, and multi-core CPUs.

Today I'm a Senior Software Engineer at Neo4j and a Research Fellow at the University of Manchester, where I'm a core maintainer of TornadoVM and the lead author of GPULlama3.java — GPU-accelerated LLM inference written in pure Java.

Previously I was a Senior Software Engineer at OctoAI (now part of NVIDIA) working on the Apache TVM compiler, and I earned my PhD at the University of Manchester on performance optimisations for heterogeneous managed runtimes. Earlier, I worked at Huawei's research centre in Paris and at Ortec Finance in the Netherlands.

Here are some of the technologies I work with:

Java, C++, Python, Bash
CUDA, OpenCL, OpenMP
JVM internals & GraalVM
TornadoVM & Tensor IR
Apache TVM, ONNX Runtime
TensorRT, PyTorch
Nsight & VTune profiling
GitHub Actions (GPU runners), Docker

Michalis Papadimitriou speaking at Devoxx Belgium 2025 — Devoxx Belgium · photo by BeJUG (CC BY-NC-ND)

Where I've Worked

Senior Software Engineer @ Neo4j

May 2026 - Present

Building runtime and compiler components for high-performance graph data processing on the JVM.
Working across the stack on performance optimisation, hardware acceleration, and developer tooling.

Some Things I've Built

Featured Project

GPULlama3.java

GPU-accelerated inference for Llama3, Mistral, Qwen, Phi-3 and Granite models written in pure Java and accelerated with TornadoVM — no native bindings required. Runs across OpenCL, PTX/CUDA and Metal backends, integrates with LangChain4j and Quarkus, and is published on Maven Central.
- Java
- TornadoVM
- CUDA / PTX
- OpenCL
- Metal
Featured Project

TornadoVM

A Java framework for transparently running JVM applications on heterogeneous hardware — GPUs, FPGAs and multi-core CPUs — without rewriting them in CUDA or OpenCL. I am a core maintainer, focusing on the optimising compiler, the memory model, and multi-device execution.
- Java
- GraalVM
- GPUs / FPGAs
- OpenCL / PTX
Featured Project

MTMD Dynamic Scheduling

A Multiple-tasks on multiple-devices (MTMD) dynamic sceduling for exploiting concurrency in heterogeneous managed runtimes.
- Java
- GPUs
- Multithread
- Machine Learning
Featured Project

GPU Memory with JIT Compilation

An alternative approach based on Just-In-Time (JIT) compilation to automatically and transparently exploit local memory allocation and data locality on GPUs.
- Java
- GPUs
- JIT Compilation
- GraalVM

Talks & Podcasts

Selected speaking

NVIDIA GTC 2026
Devoxx Greece 2026
Devoxx Belgium 2025
Oracle GraalVM Summit 2023–2025
JCON GenAI 2025
Devoxx London 2024
Apache TVM Conf 2021
Google Compiler Summit 2018/19

Invited 2026: Netflix HQ · Microsoft · Intel

Selected Publications

view the full list

From JVM to GPU: Running LLMs natively in Java
- M. Papadimitriou
A feature article in JavaPro’s Java in the Age of AI issue on running large language models directly on GPUs from Java, using TornadoVM and GPULlama3.java.
- JavaPro Magazine — Java in the Age of AI (04/2026)
Scaling Up Performance of Managed Applications on NUMA Systems
- O. Papadakis
- A. Andronikakis
- N. Foutris
- M. Papadimitriou
- A. Stratikopoulos
- F. Zakkak
- P. Xekalakis
- C. Kotselidis
Scaling up the performance of managed applications on NonUniform Memory Access (NUMA) architectures has been a challenging task, as it requires a good understanding of the underlying architecture and managed runtime environments(MRE). Prior work has studied this problem from the scope of specific components of the managed runtimes, such as the Garbage Collectors, as a means to increase the NUMA awareness in MREs. In this paper, we follow a different approach that complements prior work by studying the behavior of managed applications on NUMA architectures during mutation time. At first, we perform a characterization study that classifies several Dacapo and Renaissance applications as per their scalability-critical properties. Based on this study, we propose a novel lightweight mechanism in MREs for optimizing the scalability of managed applications on NUMA systems, in an application-agnostic way. Our experimental results show that the proposed mechanism can result in relative performance ranging from 0.66x up to 3.29x, with a geometric mean of 1.11x, against a NUMA-agnostic execution
- International Symposium on Memory Management (ISMM23)
A Multifaceted Memory Analysis of Java Benchmarks
- O. Papadakis
- A. Andronikakis
- N. Foutris
- M. Papadimitriou
- A. Stratikopoulos
- F. Zakkak
- P. Xekalakis
- C. Kotselidis
Java benchmarking suites like Dacapo and Renaissance are employed by the research community to evaluate the performance of novel features in managed runtime systems. These suites encompass various applications with diverse behaviors in order to stress test different subsystems of a managed runtime. Therefore, understanding and characterizing the behavior of these benchmarks is important when trying to interpret experimental results. This paper presents an in-depth study of the memory behavior of 30 Dacapo and Renaissance applications. To realize the study, a characterization methodology based on a two-faceted profiling process of the Java applications is employed. The two-faceted profiling offers comprehensive insights into the memory behavior of Java applications, as it is composed of high-level and low-level metrics obtained through a Java object profiler (NUMAProfiler) and a microarchitectural event profiler (PerfUtil) of MaxineVM, respectively. By using this profiling methodology we classify the Dacapo and Renaissance applications regarding their intensity in object allocations, object accesses, LLC, and main memory pressure. In addition, several other aspects such as the JVM impact on the memory behavior of the application are discussed.
- International Conference on Managed Programming Languages and Runtimes (MPLR23)
Multiple-Tasks on Multiple-Devices (MTMD): Exploiting Concurrency in Heterogeneous Managed Runtimes
- M. Papadimitriou
- E. Markou
- J. Fumero
- F. Blanaru
- A. Stratikopoulos
- C. Kotselidis
In this work, we propose a novel approach for enabling a Java-based heterogeneous managed runtime to automatically and efficiently deploy multiple tasks on multiple devices. We extend TornadoVM with parallel execution of bytecode interpreters to dynamically and concurrently manage and execute arbitrary tasks across multiple OpenCL-compatible devices. In addition, in order to achieve an efficient devicetask allocation, we employ a machine learning approach with a multiple-classification architecture of Extra-Trees-Classifiers. Our proposed solution has been evaluated over a suite of 12 applications split into three different groups. Our experimental results showcase performance improvements up 83% compared to all tasks running on the single best device, while reaching up to 91% of the oracle performance.
- Virtual Execution Environments (VEE21)
Automatically Exploiting the Memory Hierarchy of GPUs through Just-in-Time Compilation
- M. Papadimitriou
- J. Fumero
- A. Stratikopoulos
- C. Kotselidis
In this work, we propose a novel approach for enabling a Java-based heterogeneous managed runtime to automatically and efficiently deploy multiple tasks on multiple devices. We extend TornadoVM with parallel execution of bytecode interpreters to dynamically and concurrently manage and execute arbitrary tasks across multiple OpenCL-compatible devices. In addition, in order to achieve an efficient devicetask allocation, we employ a machine learning approach with a multiple-classification architecture of Extra-Trees-Classifiers. Our proposed solution has been evaluated over a suite of 12 applications split into three different groups. Our experimental results showcase performance improvements up 83% compared to all tasks running on the single best device, while reaching up to 91% of the oracle performance.
- Virtual Execution Environments (VEE21)
Transparent Compiler and Runtime Specializations for Accelerating Managed Languages on FPGAs
- M. Papadimitriou
- J. Fumero
- A. Stratikopoulos
- C. Kotselidis
In recent years, heterogeneous computing has emerged as the vital way to increase computers’ performance and energy efficiency by combining diverse hardware devices, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The rationale behind this trend is that different parts of an application can be offloaded from the main CPU to diverse devices, which can efficiently execute these parts as co-processors. FPGAs are a subset of the most widely used co-processors, typically used for accelerating specific workloads due to their flexible hardware and energy-efficient characteristics. These characteristics have made them prevalent in a broad spectrum of computing systems ranging from low-power embedded systems to high-end data centers and cloud infrastructures.
- The Art, Science, and Engineering of Programming (AOSA)
Efficient compilation and execution of JVM-based data processing frameworks on heterogeneous co-processors
- C. Kotselidis
- Et. al.
This paper addresses the fundamental question of how modern Big Data frameworks can dynamically and transparently exploit heterogeneous hardware accelerators. After presenting the major challenges that have to be addressed towards this goal, we describe our proposed architecture for automatic and transparent hardware acceleration of Big Data frameworks and applications. Our vision is to retain the uniform programming model of Big Data frameworks and enable automatic, dynamic Just-In-Time compilation of the candidate code segments that benefit from hardware acceleration to the corresponding format. In conjunction with machine learning-based device selection, that respect user-defined constraints (e.g., cost, time, etc.), we enable dynamic code execution on GPUs and FPGAs transparently to the user.
- Design, Automation & Test in Europe Conference & Exhibition (DATE2020)
Heterogeneous Computing Architectures: Challenges and Vision

Heterogeneous Computing Architectures: Challenges and Vision provides an updated vision of the state-of-the-art of heterogeneous computing systems, covering all the aspects related to their design: from the architecture and programming models to hardware/software integration and orchestration to real-time and security requirements. The transitions from multicore processors, GPU computing, and Cloud computing are not separate trends, but aspects of a single trend-mainstream; computers from desktop to smartphones are being permanently transformed into heterogeneous supercomputer clusters. The reader will get an organic perspective of modern heterogeneous systems and their future evolution.
- CRC Press
Dynamic application reconfiguration on heterogeneous hardware
- J. Fumero
- M. Papadimitriou
- F. Zakkak
- M. Xekalaki
- J. Clarkson
- C. Kotselidis
By utilizing diverse heterogeneous hardware resources, developers can significantly improve the performance of their applications. Currently, in order to determine which parts of an application suit a particular type of hardware accelerator better, an offline analysis that uses a priori knowledge of the target hardware configuration is necessary. To make matters worse, the above process has to be repeated every time the application or the hardware configuration changes.
- Virtual Execution Environments (VEE19)
Towards Prototyping and Acceleration of Java Programs onto Intel FPGAs
- M. Papadimitriou
- J. Fumero
- A. Stratikopoulos
- C. Kotselidis
In this work, we propose an approach for transparent compilation and execution of Java programs onto Intel FPGA devices. In detail, we showcase how a managed runtime environment can leverage Intel OpenCL SDK to generate specialized FPGA code, enabling prototyping and acceleration of Java Programs onto FPGAs. Finally, we describe our implementation in the context of TornadoVM with a clear objective to ease FPGA programmability allowing integration with existing frameworks.
- 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Where I've Studied

Ph.D in Computer Science @ University of Manchester

January 2018 - June 2021

Thesis: Performance Optimisations for Heterogeneous Managed Runtime Systems
Supervisor: Dr. Christos Kotselidis

What's Next?

Get In Touch

My inbox is always open — whether it's about GPU-accelerated Java, compilers and runtimes, a talk, or a new opportunity. I'll do my best to get back to you.

Say Hello

Hi, my name is

Michalis Papadimitriou.

I run Java on GPUs.

About Me

Where I've Worked

Senior Software Engineer @ Neo4j

Research Fellow @ University of Manchester

Senior Software Engineer @ OctoAI (now NVIDIA)

Software Engineer / PhD Candidate @ University of Manchester

Research Software Engineer @ Huawei Technologies

Research Software Engineer @ Ortec Finance

Some Things I've Built

GPULlama3.java

TornadoVM

MTMD Dynamic Scheduling

GPU Memory with JIT Compilation

Talks & Podcasts

Building & Running LLMs on GPUs Directly from Java with TornadoVM & GPULlama3

GPULlama3.java: Beyond CPU Inference with Modern Java

TornadoVM: The Need for GPU Speed

Revolutionizing Java-based LLMs: GPUs with TornadoVM

From SIMD to CUDA with TornadoVM

Selected speaking

Selected Publications

From JVM to GPU: Running LLMs natively in Java

Scaling Up Performance of Managed Applications on NUMA Systems

A Multifaceted Memory Analysis of Java Benchmarks

Multiple-Tasks on Multiple-Devices (MTMD): Exploiting Concurrency in Heterogeneous Managed Runtimes

Automatically Exploiting the Memory Hierarchy of GPUs through Just-in-Time Compilation

Transparent Compiler and Runtime Specializations for Accelerating Managed Languages on FPGAs

Efficient compilation and execution of JVM-based data processing frameworks on heterogeneous co-processors

Heterogeneous Computing Architectures: Challenges and Vision

Dynamic application reconfiguration on heterogeneous hardware

Towards Prototyping and Acceleration of Java Programs onto Intel FPGAs

Where I've Studied

Ph.D in Computer Science @ University of Manchester

M.Sc. in Embedded Systems @ Delft University of Technology

B.Eng. in Computer Systems Engineering @ The University of Sheffield

What's Next?

Get In Touch