Notes from MPUG, July 2013

23 people braved a winter’s night on Monday for the June meeting of the Melbourne Python User Group, which had four talks, including some PyCon AU previews later this week.

Andrew Walker previewed part of his PyCon AU talk on “Managing scientific simulations with RQ (Redis Queue)”

Andrew works with scientists working in diverse fields solving large problems; they are not Python programmers.

Scientific simulators can vary significantly in complexity and in his work:

  • assume the job is too hard for a sinple process or machine
  • also assume the job is too small for a supercomputer
  • can’t solve in the cloud
  • assume 20 – 50 cores on 5 – 10 machines

There are a range of good simulation tools where you need parallelism. IPython parallel is the best place to start and works well if not calling native functions that crash or leak memory.

There are tools that aren’t traditionally used for scientific simulation and can be very hard for scientists to set up, configure, use, monitor:

  • celery
  • Parallel Python
  • pymq
  • snakeMQ

There are some good talks from the last two years of PyCon US, worth looking at for more information on these, but what Andrew wanted something that you could get running easily in an afternoon, which led him to Redis and RQ:

  • Lets you associate a data structure with a key
  • Server is written in C
  • Python bindings (as well as other languages)
  • Python example code is very straightforward
  • Demo on one machine with 4 queues
  • Can set worker priorities

There was an example with the travelling salesman problem, starting with a naive iteration through all possible tours, improving it an approximation to find the smallest solution from a set of possibilities, which is faster, can scale this up by running in parallel tasks and you get the best result from each queue and then amalgamate to get best answer. He showed us how to do this with RQ.

RQ comes with an inbuilt monitoring tool for checking on queues, worker nodes, failures, etc., all via a web interface. The user has to ensure the code is robust, distribute source code (eg. via NFS) and use data structures that are serialisable. Potential issues included memory consumption and issues around spinning up new Python instances and cleanly killing workers.

Ed Schofield and Chris Boesch tag-teamed for another PyCon AU preview on “Big data analytics in Python”

This talk focused on using Python in a number of the big data domains, including denoising, optimisation, image interpolation, prediction, compression, clustering, classification and anomoly detection. There were plenty of interesting applications using libraries such as scipy, scikit.learn, PyTables, pandas, Hadoop and PySpark.

For me, one of the interesting parts was the discussion on scaling with MapReduce. Some random notes:

  • Be aware of deadlocks, race conditions, etc
  • pandas: great tool once you get data into it and can generate schemas on the fly?
  • Can access EC2 instances with IPython notebooks via a web browser or SSH – cool for quick jobs
  • Keeping data gets cheaper every year
  • Music Machinery blog posting
    • 20M songs, 300GB data
    • wanted to analyse songs with fastest beats per minute
    • 1M songs in 20 minutes in Python?
  • Configure cloud images once, snapshot, can then deploy as many times as needed (and tweak)
    • Amazon EC2: can choose # cores, amount of RAM, storage, etc
    • Pick the hardware you need, when you need it
    • “On demand” pricing
  • MapReduce
    • robust, builds in a lot of parallelism: build it for 2 cores, can then run on 200 cores
    • if a task fails, it can get re-run
    • MapReduce is idempotent, can chase down slow processes
    • need to shard data, map partitions, combine (optional), shuffle & sort, reduce
    • map() -> do something in parallel on every item (eg filter)
    • reduce() -> do something on all of the maps to produce a single result
    • mrjob: in PyPi, very nice! Can run:
      • single thread
      • multiple local threads
      • EMR (Amazon Elastic MapReduce) on the Amazon cloud: up to 20 instances with up to 16 cores per instance

Noon Silk was defeated by HDMI technologies so spoke briefly (and without his laptop) about PhoneGap, which is not Python based, but still worth mentioning as it took 2 hours to get an app running for iPhone and/or Android. Noon also discussed Okular, the KDE PDF (and other format) file viewer, and manipulating the XML output of embedded annotations (I missed the rest of this as I was organising the arrival of pizzas!)

Chris Boesch spoke again about “Making learning more fun”, covering SingPath, a web-based tool for (originally) learning Python. It now covers Python, Java, Ruby, Javascript, R, and uses a range of pedagological concepts including gameification, adaptive difficulty, “drag-n-drop” (which helps reading Python) and quests to teach programming.

It has been used for lots of programming tournaments over last few years, with:

  • Collaborative learning
  • Fun rounds (everyone can participate)
  • Prize rounds
  • Pair programming tournaments (you code half, your buddy codes half)

SingPath runs in Google App Engine, is freely available (on GitHub) and has been built by students for students.

This entry was posted in Python. Bookmark the permalink.