Neural Market Trends |||

Geo Distance in RapidMiner and Python

In my previous post, I showed how you can use the Enrich by Webservice operator and OpenStreetMaps to do reverse geocoding lookups. This post will show how to calculate Geospatial distances between two latitude and longitude points. First using a RapidMiner and then using the GeoPy Python module.

This was a fun because it touched on my civil engineering classes. I used to calculate distances from latitude and longitude in my land surveying classes.

My first step was to select a home” location, which was 1 Penn Plaza, NY NY. Then I downloaded the latest list of earthquakes from the USGS website. The last step was to calculate the distance from home to each earthquake location.

The biggest time suck for me was building all the formulas in RapidMiner’s Generate Attribute (GA) operator. That took about about 15 minutes. Then I had to backcheck the calculations with a website to make sure they matched. RapidMiner excelled in the speed of building and analyzing this process but I did notice the results were a bit off from the GeoPy python process.

There was a variance of about +/- 4km in each distance. This is because I hard coded in the earth’s diameter as 6371000 km for the RapidMiner process, but the diameter of the Earth changes based on your location. This is because the earth isn’t a sphere but more of an ellipsoid and the diameter isn’t uniform. The GeoPy great_circle calculation accounts for this by adjusting the calculation.

For a proof of concept, both work just fine.

There were a few snags in my python code that took me longer to finish and I chalk this up to my novice ability at writing python. I didn’t realize that I had to create a tuple out of the lat/long columns and then use a for loop to iterate over the entire tuple list. But this was something that my friend solved in 5 minutes. Otherwise than that, the python code works well. Here’s the XML of the process:


        <?xml version="1.0" encoding="UTF-8" standalone="no"?>
        <process version="6.5.002">
          <context>
            <input/>
            <output/>
            <macros/>
          </context>
          <operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process">
            <process expanded="true">
              <operator activated="true" class="open_file" compatibility="6.5.002" expanded="true" height="60" name="Open File" width="90" x="45" y="30">
                <parameter key="resource_type" value="URL"/>
                <parameter key="url" value="http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.csv"/>
                <description align="center" color="transparent" colored="false" width="126">Open Earthquake USGS URL</description>
              </operator>
              <operator activated="true" class="read_csv" compatibility="6.5.002" expanded="true" height="60" name="Read CSV" width="90" x="179" y="30">
                <parameter key="column_separators" value=","/>
                <list key="annotations"/>
                <list key="data_set_meta_data_information"/>
                <description align="center" color="transparent" colored="false" width="126">Read CSV file</description>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="6.5.002" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="30">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attributes" value="latitude|longitude|mag"/>
                <description align="center" color="transparent" colored="false" width="126">Select Magnitude, Lat, and Long</description>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="6.5.002" expanded="true" height="94" name="Filter Examples" width="90" x="447" y="30">
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="mag.gt.4"/>
                </list>
                <description align="center" color="transparent" colored="false" width="126">Filter for quakes &amp;gt; mag 4</description>
              </operator>
              <operator activated="true" class="rename" compatibility="6.5.002" expanded="true" height="76" name="Rename" width="90" x="581" y="30">
                <parameter key="old_name" value="latitude"/>
                <parameter key="new_name" value="Latitude"/>
                <list key="rename_additional_attributes">
                  <parameter key="longitude" value="Longitude"/>
                </list>
                <description align="center" color="transparent" colored="false" width="126">Rename columns</description>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="6.5.002" expanded="true" height="76" name="Generate Attributes" width="90" x="715" y="30">
                <list key="function_descriptions">
                  <parameter key="Rad_Lat" value="Latitude*(pi/180)"/>
                  <parameter key="Rad_Long" value="Longitude*(pi/180)"/>
                  <parameter key="Lat_Home" value="40.750938"/> 
                  <parameter key="Long_Home" value="-73.991594"/>
                  <parameter key="Rad_Lat_Home" value="Lat_Home*(pi/180)"/>
                  <parameter key="Rad_Long_Home" value="Long_Home*(pi/180)"/>
                  <parameter key="Rad_Diff_Lat" value="(Latitude-Lat_Home)*(pi/180)"/>
                  <parameter key="Rad_Diff_Long" value="(Longitude-Long_Home)*(pi/180)"/>
                  <parameter key="a" value="(sin(Rad_Diff_Lat/2))^2 + cos(Rad_Lat) * cos(Rad_Lat_Home) * (sin(Rad_Diff_Long/2))^2"/>
                  <parameter key="c" value="2 * atan2(sqrt(a), sqrt(1-a) )"/>
                  <parameter key="distance_km" value="(6371000*c)/1000"/>
                  <parameter key="distance_miles" value="distance_km*0.621371"/>
                </list>
                <description align="center" color="transparent" colored="false" width="126">Make lots of calculations&lt;br/&gt;</description>
              </operator>
              <connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
              <connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
              <connect from_op="Filter Examples" from_port="example set output" to_op="Rename" to_port="example set input"/>
              <connect from_op="Rename" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="0"/>
              <portSpacing port="sink_result 2" spacing="0"/>
            </process>
          </operator>
        </process>

Here’s the python process:



        #!/usr/bin/python
        import pandas as pd
        from geopy.geocoders import Nominatim
        from geopy.distance import great_circle

        geolocator = Nominatim()

        location = geolocator.geocode("1 Penn Plaza, NY, NY")
        home = (location.latitude, location.longitude) #Set Home Location

        earthquake = pd.read_csv('http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.csv') #Read CSV file

        selection = (earthquake['mag'] >= 4) #Filter for earthquakes > mag 4

        earthquake = earthquake[selection].dropna(how = 'any', subset = ['latitude', 'longitude']).drop(['time', 'depth', 'mag', 'magType', 'nst','gap','dmin', 'rms','net', 'id','updated','place','type'], axis=1)

        earthquake = earthquake.convert_objects(convert_numeric=True)

        earthquake.describe(include='all') #not necessary but I like to see a description of the data I'm pushing downstream

        earthquake['combined'] = zip(earthquake.latitude, earthquake.longitude) #create tuple from pandas dataframe

        print earthquake.combined #double check the list

        print [great_circle(home, (lt,lng)) for (lt,lng) in earthquake.combined] #brackets are a short form of loop
Up next Extracting OpenStreetMap Data in RapidMiner A few weeks ago I wanted to play with the Enrich by Webservice operator. The operator is part of the RapidMiner Web Mining extension and is How I got started brewing beer I was the ripe old age of 23 when I learned how to brew beer. I had just moved to New Mexico to start my first “real” job when I met my first office
Latest posts Democratising Machine learning with H2O — Towards Data Science Getting started with Python datatable | Kaggle Phone Addiction Version 12 Launches Today! Machine Learning Making Pesto Tastier 5 Dangerous Things You Should Let Your Kids Do The Pyschology of Writing TensorFlow and High Level APIs Driving Marketing Performance with H2O Driverless AI Machine Learning and Data Munging in H2O Driverless AI with datatable Making AI Happen Without Getting Fired Latest Musings from a Traveling Sales Engineer The Night before H2O World 2019 Why Forex Trading is Frustrating Functional Programming in Python Automatic Feature Engineering with Driverless AI Ray Dalio's Pure Alpha Fund What's new in Driverless AI? Latest Writings Elsewhere - December 2018 House Buying Guide for Millennials Changing Pinboard Tags with Python Automate Feed Extraction and Posting it to Twitter Flux: A Machine Learning Framework for Julia Getting Started in Data Science Part 2 Makers vs Takers How Passive Investing Saved My Life Startups and Open Source The Process of Writing H2O AI World 2018 in London Ray Dalio's Pure Alpha Fund Isolation Forests in H2O.ai