Out of curiosity, I rewrote their prime counter example to use a sieve instead of being a silly maximally-computation-dense example.
To make it work with taichi, I had to change the declaration of the sieve from sieve = [0] * N to sieve = ti.field(ti.i8, shape=N) but the rest of the code remained the same.
Ordinary Python:
time elapsed: 0.444s
Taichi ignoring compile time, I believe:
time elapsed: 0.119s
A slightly more realistic example than the 10x+ improvement they show on the really toy example with results that aren't too bad. I'd take a 3x improvement for tiny changes. Pretty neat!
(I tried some other trivial things like using np.int8 and it was slower. One can obviously make this a ton faster but I was interested in seeing how the toy was if we just made it slightly more memory-bound).
A negative was that throwing list comprehensions in made the python version faster - about 0.3 seconds - (and shorter and arguably more "pythonic") and simultaneously broke the port to taichi.
I tried implementing the same and I am getting 500ms vs 20ms with wrong answer in the first call in taichi but correct in subsequent calls. I guess I found some bug in taichi: https://imgur.com/a/lpK2iVF
Could you share your code as well.
N = 1000000
isnotprime = [0] * N
def count_primes(n: int) -> int:
count = 0
for k in range(2, n):
if isnotprime[k] == 0:
count += 1
for l in range(2, n // k):
isnotprime[l * k] = 1
return count
import taichi as ti
ti.init(arch=ti.cpu)
isnotprime = ti.field(ti.i8, shape=(N, ))
@ti.kernel
def count_primes(n: ti.i32) -> int:
count = 0
for k in range(2, n):
if isnotprime[k] == 0:
count += 1
for l in range(2, n // k):
isnotprime[l * k] = 1
return count
import time
import math
import numpy as np
N = 1000000
SN = math.floor(math.sqrt(N))
sieve = [False] * N
def init_sieve():
for i in range(2, SN):
if not sieve[i]:
k = i*2
while k < N:
sieve[k] = True
k += i
def count_primes(n: int) -> int:
return (N-2) - sum(sieve)
start = time.perf_counter()
init_sieve()
print(f"Number of primes: {count_primes(N)}")
print(f"time elapsed: {time.perf_counter() - start}/s")
Taichi:
import taichi as ti
import time
import math
ti.init(arch=ti.cpu)
N = 1000000
SN = math.floor(math.sqrt(N))
sieve = ti.field(ti.i8, shape=N)
@ti.kernel
def init_sieve():
for i in range(2, SN):
if sieve[i] == 0:
k = i*2
while k < N:
sieve[k] = 1
k += i
@ti.kernel
def count_primes(n: int) -> int:
count = 0
for i in range(2, N):
if (sieve[i] == 0):
count += 1
return count
start = time.perf_counter()
init_sieve()
print(f"Number of primes: {count_primes(N)}")
print(f"time elapsed: {time.perf_counter() - start}/s")
(The difference of using 0 vs False is tiny; I had just been poking at the python code to think about how I'd make it more pythonic and see if that made it worse to do taichi)
Might be worthwhile to run the same code with an appropriate numba decorator. My guess is that you'd get at least as much speed up but without having to change the sieve declaration, but I'm not sure.
To make it work with taichi, I had to change the declaration of the sieve from sieve = [0] * N to sieve = ti.field(ti.i8, shape=N) but the rest of the code remained the same.
Ordinary Python:
time elapsed: 0.444s
Taichi ignoring compile time, I believe:
time elapsed: 0.119s
A slightly more realistic example than the 10x+ improvement they show on the really toy example with results that aren't too bad. I'd take a 3x improvement for tiny changes. Pretty neat!
(I tried some other trivial things like using np.int8 and it was slower. One can obviously make this a ton faster but I was interested in seeing how the toy was if we just made it slightly more memory-bound).
A negative was that throwing list comprehensions in made the python version faster - about 0.3 seconds - (and shorter and arguably more "pythonic") and simultaneously broke the port to taichi.