Extracting OpenStreetMap Data in RapidMiner

A few weeks ago I wanted to play with the Enrich by Webservice operator. The operator is part of the RapidMiner Web Mining extension and is accessible through the Marketplace. I wanted to do reverse lookups based on latitude and longitude. In my searching I came across this post on how to do it using XPath and via Google. That post was most informative and I used it as a starting point for extracting OpenStreetMap data in RapidMiner.

Why OSM? OSM is an open source database of Geographic Inforation Systems (GIS) and is rich with data. Plus, it’s a bit easier to use than Google.

After a few minutes of tinkering, I was successful. I built a process to go out to the USGS Eartquake site, grab the current CSV, load it, and then do a reverse lookup using the latitude and longitude. The process then creates a column with the country via the XPath of //reversegeocode/addressparts/country/text().”

Here’s what the process looks like:


and the results:osmresultsExtracting OpenStreetMap Data in RapidMiner Process

I exported the example process and zipped it up. You can download it here! Make sure to check out my other Geospatial tutorials in RapidMiner by visiting my Tutorials page!

Geo Distance in RapidMiner and Python

In my previous post, I showed how you can use the Enrich by Webservice operator and OpenStreetMaps to do reverse geocoding lookups. This post will show how to calculate Geo Distance in RapidMiner between two latitude and longitude points. First using a RapidMiner and then using the GeoPy Python module.

This was a fun because it touched on my civil engineering classes. I used to calculate distances from latitude and longitude in my land surveying classes.

My first step was to select a home” location, which was 1 Penn Plaza, NY NY. Then I downloaded the latest list of earthquakes from the USGSwebsite. The last step was to calculate the distance from home to each earthquake location.

The biggest time suck for me was building all the formulas in RapidMiner’s Generate Attribute (GA) operator. That took about about 15 minutes. Then I had to backcheck the calculations with a website to make sure they matched. RapidMiner excelled in the speed of building and analyzing this process but I did notice the results were a bit off from the GeoPy python process.

There was a variance of about +/- 4km in each distance. This is because I hard coded in the earth’s diameter as 6371000 km for the RapidMiner process, but the diameter of the Earth changes based on your location. This is because the earth isn’t a sphere but more of an ellipsoid and the diameter isn’t uniform. The GeoPy great_circle calculation accounts for this by adjusting the calculation.

For a proof of concept, both work just fine.

Get the Geo Distance in RapidMiner Process

There were a few snags in my python code that took me longer to finish and I chalk this up to my novice ability at writing python. I didn’t realize that I had to create a tuple out of the lat/long columns and then use a for loop to iterate over the entire tuple list. But this was something that my friend solved in 5 minutes. Otherwise than that, the python code works well. Here’s the XML of the process:


Here’s the python process: