My current work involves some agent based simulation written in Python. I now wanted teach it some new tricks including running on multiple cores. While my normal approch for multithreading in Python is to just run several simulations in parallel to get some average, I decided it would now make sense to also have single simulations run on multiple cores. One reason for that was to make it scale more easily on the computing cluster, the other was that I assumed it would be quite easy with the agent based stucture.
For the same reasons I decided to use MPI (mpi4py): it seamed like a good fit for an agent based model (agents sending messages around) and it was available on the computing cluster. It turned out that in the end I did not use any mpi code in the agents to keep them more flexible.
Anyway, the simulations is structured in the following way: on each core/process a controller is initalized which handels not only the communication but is also in charge of the agents. This is a good fit because the simulations involves global state as base of the agents decisions.
I decide in the beginning which controller controlls which agents. This might lead to some unused processing time during the sync between the controllers because they might have to wait to get the globals state. But since each agent has a lot of data attatched, I assumed it would not make sense to depatch the agents to a different process every timestep. This would mean to shuffle around big python objects and therefore presumeably slow.
Instead only the global state is shared between the controllers. I am not sure if this is the best solutions since I did not try any other but it turned out to have a nice structure and work quite well.
One problem I sumbled upon was the that I could not gurante that each group of agents would be the same size, because when you want to simulate 100 agents on 16 cores it does not add up. This meant I could not use the normal allgather methods of MPI to collect the state but had to use allgatherv which can deal with the different sizes. Sadly there is no documentation about allgatherv in MPI4PY. With some googling I worked out an example of how it works I wanted to share so here it is:
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
mpi_size = comm.Get_size()
# work_size = 127 # arbitrary prime number
work_size = 100
work = np.zeros(work_size)
base = work_size / mpi_size
leftover = work_size%mpi_size
sizes = np.ones(mpi_size)*base
sizes[:leftover]+=1
offsets = np.zeros(mpi_size)
offsets[1:]=np.cumsum(sizes)[:-1]
start = offsets[rank]
local_size = sizes[rank]
work_local = np.arange(start,start+local_size,dtype=np.float64)
print "local work: {} in rank {}".format(work_local,rank)
comm.Allgatherv(work_local,[work,sizes,offsets,MPI.DOUBLE])
summe = np.empty(1,dtype=np.float64)
comm.Allreduce(np.sum(work_local),summe,op=MPI.SUM)
print "work {} vs {} in rank {}".format(np.sum(work),summe,rank)