RSS to iCloud, part 2

This week Google announced that they are indeed scrapping Google Reader.1 You can imagine that I’m glad I wrote my own RSS solution two weeks ago. In the last two weeks I’ve run into a couple of bugs, so I’ll post an updated a completely rewritten version here.

In my previous post I wrote that there was a potential bug for feeds that don’t set the updated time properly, because we assume that the newest items are parsed first. I found out there is a more insidious issue: if you have a feed that has the updated property on some, but not all, items, the ones without an updated time always get ignored. It appears that the best option is to keep save a list of previously parsed items, which I initially thought would be ‘overkill’.

The second bug in my previous script was incorrectly saving the time when a feed was last checked for ‘bozo’ feeds. In fact, there was another mistake in that regard: my assumption that ‘bozo’ somehow meant that the feed was unavailable or unable to be parsed. In my testing, bozo feeds (feeds that are malformed, according to the docs) get parsed perfectly fine. The only test seems to be checking whether a feed contains items, so that’s what I’ll do in the new version.

The final issue I’ll address is that the previous script ran in a single thread, which made it very slow. I’m hoping that threading will solve this.

The outline of the updated system is this:

  1. A .conf file that contains the URLs of a bunch of feeds, and for each feed a list of previously saved links (the backlog).
  2. A Python script that reads the .conf file and returns links to the new items in the feeds.
  3. An AppleScript that imports the links into Safari’s Reading List.

I’ve structured the .conf file so that adding a new feed is as simple as adding a line to a text file. Furthermore, I want to be able to force the script to reload any or all items in a feed by simply deleting previously saved items.2 The first few lines of my .conf file looked like this:

http://www.phdcomics.com/gradfeed_justcomics.php
http://feeds.feedburner.com/drbunsenblog
http://www.leancrew.com/all-this/feed/

After a run of the script, they look like this:

% 2013-03-17, 23:12

http://feeds.feedburner.com/drbunsenblog

http://feeds.feedburner.com/OnlyAModel
+   http://onlyamodel.com/2013/before-and-after-earthquake-photos/
+   http://onlyamodel.com/2013/the-dr-bunsen-custom-notebook/
+   http://onlyamodel.com/2013/responding-to-skepticism-toward-your-model/
+   http://onlyamodel.com/2013/vortex-shedding-around-skyscrapers/
+   http://onlyamodel.com/2013/winter-is-coming/
+   http://onlyamodel.com/2013/commenting-excel-files/
+   http://onlyamodel.com/2013/wordpress-blog-migration-notes/
+   http://onlyamodel.com/2013/live-from-only-a-model-dot-com/
+   http://onlyamodel.com/2013/preventing-instapaper-bankruptsy/
+   http://onlyamodel.com/2013/bay-bridge-cable-walk/

http://feedproxy.google.com/DilbertDailyStrip?format=xml
% -> http://feeds.feedburner.com/DilbertDailyStrip?format=xml
+   http://feedproxy.google.com/~r/DilbertDailyStrip/~3/NvwtMywM7KQ/

The pluses denote new links. They would be minuses for old links, or stars if the parsed feed was empty. Apparently, there were no items in the parsed feed for Seth Brown’s blog Dr. Bunsen, so I’ll keep an eye on what’s going on with that feed over the next few weeks.3 The redirected url http://feeds.feedburner.com/DilbertDailyStrip?format=xml is saved as a comment: lines in the .conf file that start with a percent sign % get ignored by the script. The time when the script was run is saved as a comment as well.

This is the updated script. Use it like rss2safari.py foo.conf.

#! /usr/local/bin/python
#-*- coding: utf-8 -*-

import feedparser
import sys
import time
import futures

def parseRSS(feed):
    content = feedparser.parse(feed['url'])
    feed['new'] = [e.link for e in content['entries']]
    try:
        feed['url*'] = content['href']
    except KeyError:
        feed['url*'] = feed['url']
    return feed

def checkRSS(argv):
    if len(argv) != 2:
         sys.exit("ERROR: Usage: "+argv[0]+" [filename]")
    conf_file = argv[-1] ;
    feeds = [] ;
    with open(conf_file,'r') as conf:
        for line in conf:
            if line == '\n' or line[0] == '%':
                continue
            if line[0] == '\t' or line[1] == '\t':
                feeds[-1]['old'].append(line[2:-1])
            else:
                feeds.append({'url': line[0:-1], 'old': []})

    with futures.ThreadPoolExecutor(max_workers=10) as e:
        futs = [e.submit(parseRSS,feed) for feed in feeds]

    with open(conf_file,'w') as conf:
        parsed_feeds = []
        conf.write(time.strftime('%% %Y-%m-%d, %H:%M')+'\n')
        for fut in futures.as_completed(futs):
            feed = fut.result()
            parsed_feeds.append(feed['url'])
            conf.write('\n'+feed['url']+'\n')
            if feed['url*'] != feed['url']:
                conf.write('% -> '+feed['url*']+'\n')
            if feed['new'] == []:
                for link in feed['old']:
                    conf.write('*\t'+link+'\n')
            else:
                for link in set(feed['new'])-set(feed['old']):
                    print link
                    conf.write('+\t'+link+'\n')
                for link in set(feed['new'])&set(feed['old']):
                    conf.write('-\t'+link+'\n')
        for url in set([feed['url'] for feed in feeds]) - set(parsed_feeds):
            conf.write(url+'\n')

if __name__ == '__main__':
    checkRSS(sys.argv)

Finally, I’ve updated the AppleScript part as well, mainly to stop myself from accidentally adding those 880 Order of the Stick comics again:

set theScriptOutput to (do shell script "~/Dropbox/bin/rss2safari.py ~/Dropbox/bin/rss2safari.conf")
set theItems to paragraphs of theScriptOutput
if (count of theItems) is greater than 10 then
    display dialog "Trying to add " & (count of theItems) & " items. Are you sure?"
end if
tell application "Safari"
    repeat with theItem in theItems
        add reading list item theItem
    end repeat
end tell

I’m fairly sure that this setup is enough to replace Google Reader for me, but I can’t guarantee that it will work for anyone else. Comments are welcome.


  1. No word about FeedBurner, but I’m guessing it’s not long for this world either. That worries me a little, because a third of the feeds I subscribe to are hosted on FeedBurner. 

  2. Note that this list of previously saved items can be huge. For example, the highly recommended webcomic Order of the Stick has links to all 880 issues in its RSS feed. Saving the backlog for the 31 feeds I subscribe to caused my .conf file to grow from 1 KB to 58 KB on the initial run. If you are worried about disk space, I suggest you stop living in 1992. 

  3. I just saw there is a different feed URL on drbunsen.org, which feedparser doesn’t parse properly either.