NewsXperiment.com - Tech Stuff - Episode 1
August 10th, 2008 | by Ozgur Cem Sen |
As stupid as it looks, and it “does NOT make any sense” at many angles, NewsXperiment bears a few interesting software technologies and paradigms.
NewsXperiment project consists of two parts: NewsXperiment Scrambler Engine (NSE), and Web frontend.
NewsXperiment Scrambler Engine runs offline and gathers, processes, scrambles and outputs a zip file that consists of scrambled news item pickles.
Once executed, NSE goes through its categorized feed repository and retrieves the feeds. Thanks to Mark Pilgrim’s excellent “feedparser” library.
Now that the feeds are read, the engine performs the following:
- randomly picks a certain number of news items from each category as base feeds.
- randomly associates a certain number of scrambler feeds to each base feed.
At this point, the engine has the initial data in place. There comes the scrambling…. However, before scrambling anything, all the entries picked to be scrambled need to be tagged, chunked, chinked.
- Using NLTK, all the titles, and summaries read are tagged, chunked, chinked.(i love this part)
- Accoding to the chunkie, chinckie data, each base feed item’s title and summary are scrambled with the set that was destined to be the scrambler for the base. Ofcourse, this does not always result in a well-constructed sentence.
- At some point, the scrambling process is completed and time to generate the output file.
- Output file is created out of each scrambled item, and consists of a list of titles, summaries and links back to the news items that are used to create them. This file is a pickle dump dictionary elements.
- The output file is datestamped, and zipped. Zip file because, doh!, it’s compressed. Plus, I couldn’t find a way around uploading the pickle content to Google AppEngine. Very likely a MIME type issue, but didn’t dig deep into that. A zipped pickle dump was all I needed, and I had it.
Very well, I have the zipped pickles, what do I do with them? If I cannot get them up to Google AppEngine’s data store, how possibly could I share ?
Two major issues I came across at this point.
- Get the file up to AppEngine hosting
- There are no means of file uploading to Google AppEngine file system other than using the appcfg.py . So be it. Way around it? Copy the zipped output file under the application directory and push it out to the AppEngine with appcfg. That’s why my app’s revision number is 1.781. Since only the changed files are uploaded, that’s not a big concern to me at this point.
- After uploading the application over to Google, I now have my file nicely sitting there to be accessed and processed. But how ?
- Able to access it, read it, and process it (get em in the AppEngine datastore).
- Again, I had no means of accessing the file locally. However, there was nothing stopping me from doing a “urlfetch“.
hint hint…. - urlfetch was the way to go, but how in the “nebula” was I going to uncompress the thing I just fetched. !!!
- Again, I had no means of accessing the file locally. However, there was nothing stopping me from doing a “urlfetch“.
Here is a code snippet of how I did it.
result = urlfetch.fetch(picklepath, payload=None, method=urlfetch.GET, headers={’Content-Type’: ‘application/zip’}, allow_truncated=False)
if result.status_code == 200:
output = StringIO.StringIO()
output.write(result.content)
zf = zipfile.ZipFile(output, ‘rb’)
#assume, we can read multiple pickles from the zipped pickle
for name in zf.namelist():
feed_dict.append(pickle.loads(zf.read(name)))
*go figure the indentation. Wordpress gave me quite a grief on that, I give up. grrr…
Yay! Perfect. I have everything I needed. Not really!!! That data has to go in the actual data store to become a part of the web. During each run, NewsXperiment generates about 5 news items per category. Each news items is usually scrambled with 8 other items, meaning 9 links including self. And this happens for 12 categories. 5 * 9 * 12 = 540. 540 things to be inserted into the Google AppEngines datastore for each batch.
Folks, this is not as easy as it sounds. I know, it’s not even funny, but the reality turned out to be this: I was not able to simply loop through my 12 categories, and insert 5 news items with 9 links at a single shot. Sad but true. I’ve seen so many of the “dreaded” “Deadline Exceeded” errors, I thought Google would just freeze my account at some point. They didn’t, at least not yet
This is where I almost gave up on this fabulous project.
Luckily, I read something about spreading the load across multiple request with “redirects”. So did I. The biggest challenge I went through at this stages was;
- How could I store my state in the stateless www realm?
- “memcache” it was. I dumped my fetched and unzipped pickle data into the memcache and started my “redirect” hopping.
- Each redirect process inserts about 30 items into AppEngine datastore. Each insertion removes the inserted item from the “memcache”, and this goes on until nothing is left in the memcache.
Phew… Everything finally made it to the datastore.
Now the visitors can visit the site, and laugh their bottoms off to “the news of the hour”.
The next article will be focused on frontend development of NewsXperiment.com. Django Templates, YUI Library, Flickr Photo Fetch etc…
I hope, this article will shed some light to some funky problems that anyone may experience with Google AppEngine. Also, wishful thinking, there might be a few folks who may actually like NewsXperiment.com and become active voters, promoters, story creators, perhaps developers of the project.
I conclude this article here, and hoping you made it all the way down. *fast scrolling don’t count.
Side Note: NewsXperiment Scrambler Engine can run as a “cron” or periodical process, but I still like to run it manually and take a look at the generated titles before I let them out to the world.













2 Responses to “NewsXperiment.com - Tech Stuff - Episode 1”
By Ozgur Cem Sen on Aug 11, 2008 | Reply
The AppEngine Group thread about this post is here; http://groups.google.com/group/google-appengine/browse_thread/thread/814ea56ae15e9b97
By Arpee on Aug 11, 2008 | Reply
such a cool technology for something so irrelevant,yet totally cool. what else are you up to these days cem?