Skip to content
MeoKit Blog
Go back

Building Podish: An iOS-optimized Linux x86 container (faster than iSH)

Edit page

中文版 (Chinese Version)

Podish Icon

Building Podish: An iOS-optimized Linux x86 container (that happens to be faster than iSH)

TL;DR: Podish is a high-performance Linux x86 user-space container optimized specifically for iOS and Apple Silicon. I wrote an i686 interpreter core in C++ and a Linux compatibility layer in C#. On an iPhone 17 (A19), it scores ~3400 on CoreMark, which is about twice as fast as iSH.

Web Demo: https://podish.meokit.com

GitHub: https://github.com/meokit/podish


Project Overview

Podish’s goal is simple: run x86 Linux programs as efficiently as possible on JIT-restricted iOS. It is not a full-system emulator (like UTM), but a user-space container — similar to iSH, but written completely from scratch, and roughly twice as fast across multiple dimensions.

What It Can Run

CategoryRepresentative ProgramStatus
Shell / Base Userlandbusybox / bash / vimStable
Scripting Runtimepython3 / LuaJITVerified; stable benchmarks
Build Toolchaingcc / makeVerified; make compile works
Network / Dev Toolsgit / OpenSSHManually verified; git clone works
Heavy RuntimeNode.js / Gemini CLIBoots; occasional crashes (V8 JIT related)

Screenshots

Terminal environment on iPhone:

Fastfetch on iPhone

Vim on iPhone

Yazi on iPhone

Browser Demo (no installation needed):

Browser Demo

CoreMark on iPhone:

CoreMark on iPhone

Running Gemini CLI:

Gemini CLI on iPhone

Gemini CLI on macOS

Performance at a Glance

WorkloadPodish (A19)iSH (A19)Speedup
CoreMark 1.034471692~2.0×
python primes.py78.3 s684.4 s~8.7×
luajit -joff primes14.5 s27.5 s~1.9×
sh -lc true warm start20 ms30 ms~1.5×

Full benchmarks and testing environment details are at the end in “Performance Data & Optimization Journey”.


The following sections dive into technical implementation details. If you only want to know the project’s current state, the overview above is sufficient. If you want to see how the interpreter gets this fast, keep reading.


Motivation & Background

iOS has a famous restriction: no JIT compilation. Specifically, the system prohibits WX (Write-XOR-Execute) memory page mappings; unsigned code cannot execute. This means you cannot download a JIT-accelerated UTM from the App Store — only the painfully slow UTM SE. You also cannot run LuaJIT in JIT mode.

My question: Just how fast can an interpreter get? Can a carefully designed, hardware-aware interpreter rival JIT performance without using JIT at all?

The main inspiration for the interpreter was LuaJIT Remake. Through it, I learned about Clang’s preserve_none, preserve_all, and [[musttail]] ABI features — tools that might let the compiler generate interpreter hot paths comparable to hand-written assembly.

The project started with hello world (week one), got CoreMark running (about a month later), and now stably runs Busybox, Bash, Python, LuaJIT, GCC, and even boots Node.js and Gemini CLI.


Overall Architecture

Podish is not a monolithic design; it has clear layers:

flowchart TB
    subgraph Guest["Guest x86 Linux Binary"]
        A["i686 ELF / libc / app"]
    end

    subgraph Core["Emulation Core (C++)"]
        B["Predecode / Block Builder"]
        C["Interpreter Dispatch + Semantics"]
        D["MMU / MicroTLB / SoftTLB"]
    end

    subgraph Runtime["Linux Runtime (C#)"]
        E["Syscall Layer"]
        F["VFS / Process / Signal / Network"]
        G["Netstack (smoltcp)"]
    end

    subgraph Orchestration["Container Layer (C#)"]
        H["Podish.Core / OCI"]
        I["Podish.Cli"]
    end

    subgraph Host["Host Platform"]
        J["iOS / macOS / Browser"]
        K["SwiftUI / SDL"]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    F --> H
    H --> I
    I --> J
    J --> K
%%endmermaid

The repository is roughly organized into these layers:

Why C++ for the core and C# for the runtime?

This is a deliberate engineering trade-off.

Why C#? Early on, I needed to rapidly implement the semantics for about 200 Linux syscalls, a VFS layer, a network stack, and container lifecycle management. C#‘s cross-platform I/O, string handling, async model, and rich standard library allowed me to push from “hello world” to a running CoreMark in just a month. If I had used pure C++, I’d probably still be writing wrappers for std::filesystem.

Controlling boundary overhead: To prevent P/Invoke from becoming a bottleneck, I made a few key design choices:

Profiling shows that in compute-heavy workloads (like CoreMark), language boundary overhead accounts for less than 1%. In I/O-heavy workloads (like git clone), the bottlenecks are the network and VFS, not P/Invoke.


Interpreter Core: Four Key Design Decisions

The core goal of the interpreter is: push the overhead of x86 emulation as low as possible without using JIT. Here are the four most critical design decisions.

1. Pre-decoding + Tail-call Dispatch

Traditional interpreters usually have a central dispatch loop (while (1) { decode; dispatch; execute; }), returning to the loop after every instruction. My approach shifts dispatch forward:

flowchart LR
    A["x86 Binary"]
    B["BasicBlock"]
    C["Handler: mov"]
    D["Handler: add"]
    E["Handler: jmp"]
    Ret["Return to X86_Run"]
    A -->|Predecode| B
    B -->|Entry| C
    C -->|musttail call| D
    D -->|musttail call| E
    E -->|Block end| Ret
%%endmermaid

Specifically:

  1. Pre-decode: Variable-length x86 instructions are decoded into a fixed-length Intermediate Representation (DecodedOp, 32 bytes), stored sequentially in a Basic Block.
  2. Tail-call chain: Each IR carries its own handler function pointer. After executing the current instruction, instead of returning to a central dispatcher, it directly tail-calls into the next instruction’s handler via [[musttail]].
  3. Caching: Decoded results are cached in BasicBlock, avoiding repeated decode.

The result: roughly 10× faster than QEMU TCI. (A big shoutout to Justine Tunney’s blink project, which introduced me to Intel’s XED and helped me build the table-driven decoder).

DecodedOp is fixed at 32 bytes, aligned to a 16-byte boundary:

FieldTypeOffsetSizeDescription
handlerHandlerFunc* (function pointer)08Entry point for the instruction’s semantics
next_eipuint32_t84PC of the next instruction
lenuint8_t121Instruction byte length
modrmuint8_t131ModRM byte
prefixesPrefixes (union with bitfields)141Prefix info (lock/rep/segment…)
metaMeta (union with bitfields)151Meta info (has_mem / has_imm / ext_kind…)
ext.dataDecodedMemData1616Memory operand description (imm / ea_desc / disp)
ext.link.next_blockBasicBlock*248Cached successor block pointer after block linking
ext.controlDecodedControlFlowData1616Control-flow target (target_eip / cached_target)

If we take a simple memory ALU instruction as an example:

Guest:
  add eax, [ebx+4]

Predecoded IR:
  entry      = op_add_r32_rm32
  next_pc    = 0x...
  operands   = { dst=eax, src=mem(base=ebx, disp=4, size=4) }
  flags_mode = arithmetic
  control    = fallthrough

2. Parity-Only Lazy Flags

Implementing x86 EFLAGS accurately is tedious. QEMU uses a very elegant lazy evaluation system where they store CC_SRC, CC_DST, and CC_OP, only calculating the flags when explicitly requested.

I tried implementing this, and it was slower than calculating them eagerly. Why? Because writing those three variables to memory and reading them back was killing me — I was already memory-bound.

My compromise: I only evaluate the Parity Flag (PF) lazily.

flowchart TB
    A["ALU Instruction"] --> B{"Writes Flags?"}
    B -->|Yes| C["Update CF/ZF/SF/OF/AF/DF in flags_cache[31:0]"]
    C --> D["SetParityState: store result byte to flags_cache[47:40]"]
    B -->|No| E["Skip flags update"]

    F["jcc / cmov / setcc"] --> G{"Cond depends on PF?"}
    G -->|No| H["Read flags_cache[31:0] directly"]
    G -->|Yes| I["EvaluatePF: compute parity from flags_cache[47:40]"]
%%endmermaid

During execution, flags_cache is a uint64_t passed through the tail-call chain:

Why only lazy PF? CF/ZF/SF/OF are written by almost every ALU instruction and read by almost every jcc/cmov/setcc. Doing them lazily adds memory overhead that isn’t worth it. But PF is expensive to calculate (requires iterating bits) and rarely used (only by JP/JNP, cond 10/11), so it’s worth lazying.

Static layer: Within a basic block, I do def-use analysis on flags. If an instruction writes flags but no one reads them later, it is replaced directly with a no-flags handler variant. This variant is expressed through template parameters like AluAdd<T, false> and AluSub<T, false>; the compiler completely elides the entire flags-update path.

Dynamic layer: During execution of each instruction, flags_cache is passed as a register argument through the tail-call chain and is never written back to memory. Only PF is lazy:

Commit semantics: CommitFlagsCache is only called once at handler chain exit boundaries (ExitOnCurrentEIP, ExitOnNextEIP, restart/retry, resolver miss), writing the register-held flags_cache back to state->ctx.flags_state. During successful chaining it is never committed — because the next instruction’s handler will continue passing the same flags_cache as a register argument. X86_Run() and X86_Step() also do not commit after the handler returns, to avoid overwriting updated state with a stale caller-side copy.

An interesting detail is the CheckCondition LUT path: the vast majority of conditional jumps (cond 0-9, 12-15) do not depend on PF, so they read the low 32 bits via GetFlags32Raw(flags_cache) and use a LUT lookup to decide the branch direction; only cond 10/11 (JP/JNP) branch separately to EvaluatePF(). This design makes the extra cost of PF laziness nearly zero.

3. Memory Access is the Real Bottleneck: MicroTLB and SoftTLB

I spent so much time analyzing dispatch overhead, only to realize during profiling that my biggest issue was shockingly basic: address translation.

flowchart LR
    A["Guest Address"] --> B{"MicroTLB Hit?"}
    B -->|Yes| C
    B -->|No| D{"SoftTLB Hit?"}
    D -->|Yes| E
    D -->|No| F
    F --> G
    G --> E
    E --> C
    C["host_ptr = guest_addr + addend"]
    E["Refill MicroTLB"]
    F["Page Table Walk"]
    G["Permission Check"]
%%endmermaid

SoftTLB is a three-way direct-mapped TLB consisting of three fixed-size tables:

FieldTypeOffsetSizeDescription
read_tlbstd::array<TlbEntry, 256>04096Fast read-permission mapping table
write_tlbstd::array<TlbEntry, 256>40964096Fast write-permission mapping table
exec_tlbstd::array<TlbEntry, 256>81924096Fast execute-permission mapping table

TlbEntry is aligned to 16 bytes, fixed at 16 bytes:

FieldTypeOffsetSizeDescription
taguint32_t04Guest page tag (high 20 bits)
permProperty (enum)44Permission bits (Read / Write / Exec / Dirty …)
addendstd::uintptr_t88host_ptr = guest_addr + addend

Indexing is also very direct:

idx = (guest_addr >> PAGE_SHIFT) & 255
tag = guest_addr & ~PAGE_MASK
host_ptr = guest_addr + addend

In other words, SoftTLB’s job is to compress already-looked-up guest page mappings into a fast tag + addend entry. On a hit, you just check the tag and do one addition to get the host address; on a miss, you fall back to the slow path to check permissions, fill the entry, and handle exceptions.

It turned out to be a very silly TLB refill bug. Fixing it immediately bumped my CoreMark from 600 to 800. It made me realize: Address translation is the true bottleneck.

I introduced a MicroTLB — a tiny 16-byte structure kept permanently in a host register during the execution chain. If it hits, it’s just a quick tag match and an addition (host_ptr = guest_addr + addend).

MicroTLB is a resident register structure, aligned to 16 bytes, fixed at 16 bytes:

FieldTypeOffsetSizeDescription
tag_ruint32_t04Read-permission tag, default 0xFFFFFFFF (miss state)
tag_wuint32_t44Write-permission tag, default 0xFFFFFFFF (miss state)
addendstd::uintptr_t88host_ptr = guest_addr + addend

Note the data layout: read_tag + write_tag are merged into the same register, so this structure occupies two host registers.

Its hit/miss path can be represented as:

guest address
  -> check read_tag / write_tag
  -> hit: host_ptr = guest_addr + host_guest_addend
  -> miss: full translation + permission check + refill MicroTLB

During refill, permissions are checked; if there is no read permission, read_tag is cleared, and vice versa.

This design may seem odd — if reads and writes repeatedly hit different pages, the MicroTLB will ping-pong and the hit rate drops to zero.

But most of the time it works. My measurements show a hit rate above 50%. As long as there is some hit rate, reducing the frequency of reads from the in-memory SoftTLB provides a performance boost.

4. Block Linking and Superopcodes

In this interpreter, BasicBlock is an object with a fixed-size header followed by a variable-length instruction stream. Its header looks roughly like this:

BasicBlock is aligned to 16 bytes; the header is fixed at 48 bytes, followed by a variable-length DecodedOp array:

FieldTypeOffsetSizeDescription
chain.start_eipuint64_t (bitfield: 32 bits)04 (embedded in BasicBlockChainPrefix)Block start guest PC
chain.inst_countuint64_t (bitfield: 8 bits)41Number of instructions in the block
chain.validuint64_t (bitfield: 1 bit)51 bitWhether the block is valid (for invalidation marking)
entryHandlerFunc*88Semantic entry point of the first instruction; the interpreter jumps here directly
end_eipuint32_t164Block end address (next_eip of the last instruction)
slot_countuint32_t204Total slot count (including trailing sentinel)
sentinel_slot_indexuint32_t244Index of the sentinel in the slots array
branch_target_eipuint32_t284Branch target address (valid for conditional/direct jumps)
fallthrough_eipuint32_t324Fallthrough address
terminal_kind_rawuint8_t361Terminal kind (None / DirectJmpRel / DirectJccRel / Other)
block_padding0uint8_t371Padding
block_padding1uint16_t382Padding
exec_countuint64_t408Number of times the block has been executed (for profile-guided superopcode)
slots[]DecodedOp[]48Variable (32 bytes each)Pre-decoded instruction stream

A few fields are especially critical:

DecodedOp itself is not a minimal “handler-pointer-only” structure. Beyond handler and next_eip, it also stores memory operand info, control-flow targets, or the next_block pointer cached after block linking in its extension area.

So what BasicBlock really does is bind three things together:

When I later implemented block linking, I stitched and reused directly on top of the existing BasicBlock.

flowchart LR
    A["Block A"] -->|"fallthrough"| B["Block B"]
    A -->|"branch taken"| C["Block C"]

    D["Superopcode"] -->|"fused"| E["pop ebx ; pop esi"]
    D -->|"fused"| F["test eax, eax ; je"]
    D -->|"fused"| G["mov eax, [esp] ; sub reg, eax"]
%%endmermaid

Block Linking: If a basic block is short enough, its instructions don’t cross page boundaries, and its control flow is simple, the successor block is stitched directly to the end of the current block. This reduces some indirect memory access during inter-block jumps.

Superopcode: At first I tried simple bigram statistics, but the results were mediocre. The reason is obvious: high-frequency bigrams aren’t necessarily frequently executed — they just have high frequency among all pairs, and the total number of fixed instruction pairs isn’t large to begin with. Additionally, because we do instruction specialization, the specialized instructions become extremely sparse, making even fewer pairs fixable.

The more effective approach is:

This strategy ultimately generated about 256 superopcodes, and was one of the key steps in stably pushing performance past 3000+ CoreMark.

Representative examples in the current generation set include:

The candidate discovery pipeline is essentially not complicated:

block trace
  -> hotspot statistics
  -> anchor selection
  -> def-use / RAW filtering
  -> code generation
  -> regression verification and benefit review

SMC (Self-Modifying Code) Handling

SMC (Self-Modifying Code) is standard in modern JIT engines (V8, LuaJIT, .NET JIT): they write machine code to memory and immediately jump to it. The emulator must handle this correctly.

My approach is not “monitor every write operation globally,” but rather reuse the MMU’s permission system, pushing detection costs into permission bits:

flowchart TB
    A["mmap/mprotect\nPROT_EXEC | PROT_WRITE"] --> B["Register External Alias"]
    B --> C{"Live BasicBlock\non this page?"}
    C -->|Yes| D["Set ForceWriteSlow\non guest page entries"]
    C -->|No| E["Normal page"]

    F["Guest Write Instruction"] --> G{"ForceWriteSlow?"}
    G -->|Yes| H["invalidate_code_cache_page"]
    H --> I["Mark blocks invalid"]
    G -->|No| J["Fast write path"]
%%endmermaid

When a host memory page is mapped as both Executable and Writable, the MMU internally marks it as an External Alias. If BasicBlocks have already been pre-decoded and cached on this page, the MMU tags the write permission for this page with a ForceWriteSlow flag. Any subsequent write to this page will miss the fast TLB, hit ForceWriteSlow, and be forced into a slow path. The slow path calls invalidate_code_cache_page(addr), invalidating all BasicBlocks associated with that page.

Execution-time race: Merely invalidating the block cache on write is not enough — if the current EIP happens to fall on the page being written, the tail-call jump to the next instruction may already have been overwritten. For this, I implemented ShouldInterceptExecWriteForSmc: once it detects that the page containing the current EIP is being modified, it immediately yields and switches to a single-instruction safe mode (max_insts = 1), guaranteeing a clear instruction boundary between the write operation and the jump to new code, preventing races.

Multi-Engine TLB Consistency: clone(CLONE_VM) creates new threads that share the same MmuCore, but each Engine has its own SoftTLB. For this, I implemented a RuntimeTlbShootdownRing (1024-slot ring buffer): the party that changed the page table writes the flushed guest page into the ring, and other Engines consume from this ring and flush their local TLBs before their next execution.

This mechanism allows LuaJIT’s -joff mode to run stably and even lets Node.js/V8 start up. Occasional crashes remain a known limitation, likely due to subtle bugs in my instruction set implementation or incomplete Linux syscalls.

SMC Mechanics in Detail

The overview above covers the high-level idea, but if you’re building something similar, the details below are worth reading. The core philosophy is to push all detection costs into the MMU permission bits, avoiding global write monitoring or disassembly scans.

External Alias Tracking. MmuCore maintains an external_aliases map internally (key is the host page pointer; value contains exec_count, write_count, and the set of associated guest pages). When the C# layer maps a host page as External + Write + Exec via mmap, that page gets registered in this table. This table is only updated on page-table changes (mmap/mprotect/munmap); the hot path never touches it.

Lazy Arming of ForceWriteSlow. What actually determines whether an external page needs SMC detection is not whether it has Exec permission, but whether there are cached BasicBlocks on it. refresh_smc_armed_for_host_page() is called in two situations:

  1. A new block has been decoded, and we need to check if any of the host pages it covers have writable external mappings;
  2. An external mapping is established or modified (e.g., mprotect adds PROT_WRITE).

If (exec_count > 0) && (there is a live block on this host page), all guest page entries for this page are tagged with ForceWriteSlow. This bit is a runtime temporary mark; it is not persisted, not copied by fork, and does not appear in deep copies of the page table.

Write-path Interception Chain. When the guest executes a write instruction:

write() -> MicroTLB miss -> SoftTLB miss -> resolve_slow()
  -> look up page directory -> find ForceWriteSlow in permissions
  -> call invalidate_code_cache_page(guest_addr)
  -> mark all BasicBlocks on the associated host page as invalid
  -> continue with the actual memory write

invalidate_code_cache_page does not traverse the entire block cache. Instead, it uses a page_to_blocks reverse index (host page → block list) to achieve O(number of affected blocks) local invalidation. Blocks marked invalid are automatically re-decoded the next time X86_Run tries to enter them.

Execution-time Race: Old Code Currently Running vs. New Code Just Written. Merely invalidating the block cache on write is not enough — if the current EIP happens to fall on the page being written, the tail-call jump to the next instruction may already have been overwritten. For this, mmu_impl.h has ShouldInterceptExecWriteForSmc:

if (state->intercept_exec_write_for_smc && !state->allow_write_exec_page) {
    uint32_t current_page = state->ctx.eip >> 12;
    uint32_t target_page  = addr >> 12;
    if (target_page == current_page || target_page == current_page + 1)
        return true;  // trigger SMC yield
}

Once this condition hits, the write does not execute immediately. Instead, state->smc_write_to_exec is set, and the current handler returns a special yield flow. When X86_Run’s main loop detects smc_write_to_exec, it enters single-instruction safe mode:

This guarantees a clear instruction boundary between the write operation and the jump to new code, preventing races.

TLB Consistency Under Multi-Engine Sharing. When clone(CLONE_VM) creates a new thread, multiple Engines share the same MmuCore, but each Engine has its own SoftTLB and block_lookup_cache. Engine A’s invalidate_code_cache_page does not automatically flush Engine B’s local TLB. For this, MmuCore has a RuntimeTlbShootdownRing (1024-slot ring buffer):

This ring’s capacity is large enough to cover the vast majority of single mprotect/munmap page ranges; if it overflows, a full flush is issued directly.


A Failed Experiment with Copy-and-Patch JIT

The concept of a Copy-and-Patch JIT is beautifully simple: pre-compile opcode handlers into binary stencils, and at runtime, just copy the template and patch in the constants/addresses. Theoretically it bypasses the complexity of a full backend, register allocation, and traditional machine-code generation (see paper: Copy-and-Patch Compilation).

Since I was already using preserve_none + [[musttail]] to push interpreter hot paths incredibly low, I tried turning handlers into stencils, patching operands, and stringing them together into machine code. I was hoping for a 200%+ performance boost. It was actually slightly slower than my interpreter.

The reason: my direct-threaded interpreter had already pushed dispatch overhead incredibly low. The real bottlenecks were memory access, address translation, state maintenance, and I-Cache pressure. By generating stencils, I eliminated some bytecode reading, but the resulting code bloat absolutely wrecked the I-Cache.

It was a humbling lesson: a poorly fitted JIT is just garbage, and loading immediates isn’t necessarily faster than reading bytecode if your cache is thrashing.

This forced me to profile more carefully:


Threading Model

flowchart TB
    subgraph Scheduler["KernelScheduler (1 Host Thread)"]
        S["Scheduler Loop"]
    end

    subgraph TaskA["FiberTask A"]
        A1["Engine A"]
        A2["SoftTLB A"]
        A3["block_lookup_cache A"]
    end

    subgraph TaskB["FiberTask B"]
        B1["Engine B"]
        B2["SoftTLB B"]
        B3["block_lookup_cache B"]
    end

    subgraph Shared["Shared (Atomic RC)"]
        M["MmuCore"]
        R["RuntimeTlbShootdownRing"]
    end

    S -->|"yield / switch"| A1
    S -->|"yield / switch"| B1
    A1 --> M
    B1 --> M
    M --> R
%%endmermaid

I implemented clone/fork/vfork and basic pthread semantics, but the concurrency model is not OS-level multi-threaded parallelism:

The ceiling of this design is that multiple guest threads are serialized at the host level and cannot leverage multi-core parallelism. For shells, Python, and compilation tasks inside the container, this is usually enough; but if the guest runs heavily parallel scientific computing, performance will be noticeably limited.


Linux Compatibility Layer & Usability

Beyond the CPU core, the project also implements a Linux compatibility environment, including:

This layer is what determines whether the project can move from benchmarks to real-world workloads.

CategoryExample / EvidenceCurrent StatusMain Limitation
Shell/bin/sh -lc trueVerifiedCurrently covers shell startup and short commands
Coreutils / Archivegrep -R / tar cf / tar xfVerified
Scriptingpython (primes.py)VerifiedWorkload currently has only a few samples; easily affected by phone temperature
JIT-heavyLuaJIT (primes_jit.lua)VerifiedResults were once polluted by a notify-socket / network-path bug, now fixed
Build-oriented mixed workloadmake compile on CoreMark treeVerifiedCurrent main table uses a compile-only metric
Network / tooling compatgit cloneCompatibility verifiedHigh network variance

Graphics & Audio: Experimental Wayland and PulseAudio Support

The codebase already has two experimental multimedia pipelines, but they haven’t matured enough for daily testing:

Wayland Foot Terminal

These two features are currently macOS only; iOS / Web adaptations have not been done yet.


Performance Data & Optimization Journey

The Journey from 600 to 3000

PhaseKey ChangeCoreMarkNotes
InitialBaseline interpreter~600ABI / TLB design was in place, but a TLB bug ruined the hot path
BugfixFixed address translation~800Confirmed address translation was the biggest issue
Hot PathPC / block budget / template specialization~1500Stopped writing redundant state to memory
MemoryData layout / paired loads~2000Shifted focus from dispatch to memory access
Lazy FlagsStatic pruning + PF lazy~2200Full lazy flags trial was slower; chose the middle ground
LinkingAppend simple successor blocks~2500Lowered block boundary costs
SuperopsProfile-guided fused ops~3000~256 anchor + RAW superopcodes stabilized at ~3000

Key takeaways:

Full Benchmarks

Testing environment:

Startup & Compute Intensive:

WorkloadPodish(A19)Podish(M3)iSH (A19)Podman i386 (QEMU JIT)QEMU TCINative (M3)Notes
sh -lc true warm start20 ms20 ms30 ms30 ms30 ms10 msEnd-to-end startup cost; Podish uses Podish.Cli run; QEMU uses explicit qemu-i386 -L <rootfs>
CoreMark 1.03447296716921145632538087
python primes.py78.3 s89.4 s684.4 s40.9 s787.3 s1.8 sScripting benchmark from kostya/benchmarks
luajit primes3.1 s4.0 s46.6 s1.7 s39.3 s0.2 s
luajit -joff14.5 s17.3 s27.5 s7.1 s152.7 s0.7 s

File I/O & Mixed Workloads:

WorkloadPodish(A19)Podish(M3)iSH (A19)Podman i386 (QEMU JIT)Native (M3, arm64)Notes
grep -R on CoreMark tree50 ms50 ms70 ms70 ms10 msExcluding .git, GNU grep
tar cf CoreMark tree40 ms30 ms40 ms97 ms10 msGNU tar
tar xf CoreMark tree250 ms250 ms170 ms161 ms30 msGNU tar
make compile on CoreMark tree9470 ms10460 ms11430 ms7330 ms210 msmake compile; native is single-process make -j1 compile CC=clang
git clone CoreMark3230 ms3300 ms5190 ms2660 ms2980 msCompatibility verification only; high network variance

Both tables above, except where explicitly noted, are medians of 5 runs. QEMU TCI is omitted from the File I/O table (too slow to be meaningful here).

Two “Counter-Intuitive” Data Points

Why is the A19 faster than the M3 Max?

In this specific project, the interpreter is entirely bound by single-thread IPC and memory latency. It can’t leverage the M3 Max’s massive multi-core architecture or 300GB/s bandwidth. CoreMark fits easily in the cache, so the A19’s slight lead is just a reflection of architectural generation differences in single-core latency.

luajit -joff is faster than the JIT on iSH

On iSH, luajit -joff running primes_jit.lua took 27530 ms, while JIT mode took 46650 ms. Theories:

I haven’t deeply studied iSH’s internal implementation; the above is speculation. If you’re familiar with iSH’s source, I’d love to hear the real cause in the comments.

Comparison with Wasm3

As another pure-interpreter reference, I ran wasm3 coremark.wasm five times on the same MacBook:

In other words, Podish’s pure compute performance is already competitive with Wasm3.


Conclusion

Podish proves one thing: on platforms where JIT is forbidden, a carefully designed, profiled, and hardware-aware interpreter can achieve surprising performance. It’s not a silver bullet — complex applications like Node.js still have significant bottlenecks, and multi-threading cannot leverage multiple cores — but for command-line tools, scripting runtimes, and lightweight compilation tasks, it is already a usable daily tool.

If you’re interested in the project, check out the source on GitHub or try the Podish Web Demo directly in your browser.


Edit page
Share this post on:

Previous Post
一个比 iSH 更快的 iOS Linux x86 容器