Save ScraperWiki Data to Local SQLite Database November 9, 2011

Update: https://github.com/christophermanning/scraperwiki_local_python is now available and makes it easier to set up your local ScraperWiki environment.

ScraperWiki is great for aggregating data and making it publicly available. When developing a script locally, you can use https://github.com/onyxfish/fakerwikito simulate saving to the database, but it doesn't actually write to a local SQLite database.

Luckily, you can rewire your local copy of the Python development library for ScraperWiki to create and save to a local SQLite database with Three Easy Steps™.

First, overwrite scraperlibs/python/scraperwiki/datastore.py with:

Second, copy https://bitbucket.org/ScraperWiki/scraperwiki/src/dd2217221fc3/services/datastore/datalib.py to scraperlibs/python/scraperwiki/datalib.py

Finally, replace line 7 of scraperlibs/python/scraperwiki/__init__.py with: import os logfd = os.fdopen(1, 'w', 0)

Now, running scraperwiki.sqlite.save() will create a local database at working_directory/script_name/defaultdb.sqlite with the database name swdata.

If after setting this up you notice your script is spending a lot of time writing to the database, check how many times you are calling scraperwiki.sqlite.save. Each time you call scraperwiki.sqlite.save it commits the data to SQLite which has its share of overhead and if you are calling that once for each row in large file, it will take a lot longer than necessary.

Instead, you can pass a list of dictionaries to scraperwiki.sqlite.save to significantly cut the time spent saving your data to the database. This approach works great in a local environment, but you have to consider the amount of memory that is being used when creating this large list of dictionaries since you have limited resources on ScraperWiki. There, it may make sense to call scraperwiki.sqlite.save once per row or to send many batches of records instead of one large batch which may exceed your resources and cause a SIGKILL (EXECUTIONSTATUS: [Run was interrupted]) to appear in your run history.

 

comments powered by Disqus