Scraping SEPTA with Python
I was trying to get geographic data for bus and trolley routes on SEPTA (Southeastern PA Transportation Authority) and came across their data page. I noticed that they host all KML files on their site, but don’t offer a page or zip file; rather, they give a url format ending in the route number or letter: (http://www3.septa.org/transitview/kml/17.kml.
I didn’t want to do this by hand, so I started playing around with the Beautiful Soup library in python to build a list of bus and trolley routes. I’m a Python novice, so I turned it into a learning exercise and extended the sript to download all of the KML files and, with the help of some node.js modules, convert them into separate geoJSON files, and then aggregate those files into a giant geoJSON file of all bus and trolley routes. I then foolishly loaded that giant file with Leaflet.js; so if you want to see a 10 MB geoJSON file that will probably crash your browser, click here.
The script, while basic, does what it should, and leaves you with a directory of KML files you can load in Google Earth. You can even open them all at once without crashing it.