I have a script that pull in data using from the net, and does some work on the data before storing it in the Datastore. However, the script was kicked off by cron every now and then, but was running into the 30s time limit set by the App Engine. The obvious solution was to split the work into smaller pieces and assign these to the
Task Queue.
However, after reading that documentation few times I was more confused than before I started. The Task Queue API is actually great, but someone should revise that documentation... Above all, it is not clear at all how the different configurations relate to each other, the examples in the documentation is made for "over complicated" examples, IMHO.
Here is Task Queue explained for Dummies (like myself):
First of all, there is the
default queue, and in addition you can create your own queues using the
queue.yaml config file. The reason you to create your own queue would be that you are not happy with the default execution of 5 tasks per second of the
default queue. Let's start looking at using just the
default queue, and later on we will expand with creating our own queue too.
In this example the original script was doing work for seven days, and we will split it into seven smaller tasks:
- When using the default queue, we do not need to create any queue.yaml file at all.
- To start with, we need a URL that cron, or yourself, can use to kick off the whole affair, in app.yaml add for example:
- url: /update
script: scripts/all-to-q.py
login: admin
- Now, create the all-to-q.py file in the scripts directory with content like:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import logging
from google.appengine.api.labs import taskqueue
for i in range(7):
taskqueue.add(url='/one-day', params={'dayI': i}, countdown= i)
logging.info('Adding day '+str(i)+' to the Task Queue.')
The countdown parameter adds a little delay for each new task before it is executed.
- Now, go back to the app.yaml file and add that new URL you need for each task:
- url: /one-day
script: scripts/one-day.py
login: admin
Simple, isn't it.
- And now the essential parts of the one-day.py file; mainly those that will pick up the POST parameters (here just one called 'dayI'):
import wsgiref.handlers
from google.appengine.ext import webapp
class OneDay(webapp.RequestHandler):
def post(self):
i = int(self.request.get('dayI'))
# ... and here you get your hands dirty; use i and do the work.
def main():
application = webapp.WSGIApplication([
(r'/one-day', OneDay),
], debug=True)
wsgiref.handlers.CGIHandler().run(application)
if __name__ == '__main__':
main()
...and that's it.
- I don't understand why the official documentation could not explain something this simple...; I believe my example above makes it fairly clear how the execution logic flows.
PS. Note that I also included some
logging above; it is really useful... Expand on it yourself.
Now, let's say we do not want to overload the sites we pull data from, so we will create our own queue and used that instead of the
default queue. All we need to do is:
- Create that queue.yaml file, with for example:
queue:
- name: one-full-day
rate: 1/s
bucket_size: 1
- Now, in order to use that queue, change one single line in all-to-q.py so it reads:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import logging
from google.appengine.api.labs import taskqueue
for i in range(7):
taskqueue.Task(url='/one-day', params={'dayI': i}, countdown= i).add(queue_name='one-full-day')
logging.info('Adding day '+str(i)+' to the Task Queue.')
Done. Possibly the taskqueue. line above wraps the row here in the blog, but it's a single line.
Wasn't that easy...