Anyway, I'm intimidated by perl (too hard too read, too many nonalphnumeric characters) so I rewrote the script in python. The first run will take a long-ish time, since it's downloading all 450+ existing episodes of the program. Subsequent executing of the script will be faster, since it only has to download new episodes. Enjoy!
By the way, AFAIK, this type of webcrawling is completely legal. The content is already streamable from the TAL website; you're just downloading it er, a little faster than usual.
That said, if you use this script, I'd recommend making a tax-deductible contribution to This American Life -- it's a great program, worthy of support. The "donate" button is in the upper-right corner of the This American Life webpage.
#!/usr/bin/python
# Adapted from: http://www.seanfurukawa.com/?p=246
# Translated from perl to python by Abe Gong
# Dec. 2011
import urllib, glob, datetime
def now():
"""Get the current date and time as a string."""
return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
def log( S ):
"""Write a line to the log file, and print it for good measure."""
logfile.write(S + '\n')
print S
#Start up a log file
logfile = file( 'tal_log.txt', 'a' )
#Load all the episodes that have already been downloaded; keep the filenames in a list
episodes = [ f.split('/')[-1] for f in glob.glob('episodes/*.mp3') ]
#print episodes
#As of today (12/11/2011) there are 452 episodes, so a count up to 500 should last a long while.
for i in range(1,500):
#Choose the appropriate filename
filename = str(i)+'.mp3'
#Add the URL prefix
url = 'http://audio.thisamericanlife.org/jomamashouse/ismymamashouse/'+filename
#Check to see is the file has already been downloaded
if not filename in episodes:
#Log the attempt
log( now() + '\ttrying\t' + url )
#Try to download it
code = urllib.urlopen( url ).getcode()
if code == 200:
urllib.urlretrieve( url, filename='episodes/'+filename )
#Log the result -- success!
log( now() + '\tsaved\t' + filename )
else:
log( now() + '\tfile not found' )
No comments:
Post a Comment