August 12, 2017

Use RapidMiner to Discover Twitter Content

Welcome to this new tutorial on how to use RapidMiner to discover Twitter Content. I created this process as a way to monitor what’s going on in the Twitter universe and see what topics are being tweeted about. Could I do this in Python scripts? Yes, but that would be a big waste of time for me. RapidMiner makes complex ETL and tasks simple, so I live and breathe it.

Why this process?

Back when I was in Product Marketing, I had to come up with many different blog posts and collateral’ to help push the RapidMiner cause. I monitor what goes on on KD Nuggets, DataScience Central, and of course Twitter. I thought, it would be fun to extract key terms and subjects from Twitter (and later websites) to see what’s currently popular and help make a bigger splash’ when we publish something new. I’ve since applied this model to my new website Yeast Head to see what beer brewing lifestyle bloggers are posting about. The short end of that discussion is that the terms #recipes’ and #homebrew_#recipes’ are most popular. So I need to make sure to include some recipies going forward.  Interestingly enough, there’s a lot of retweets with respect to Homebrewer’s Association, so I’ll be exploiting that for sure.

The Process Design

This process utilizes RapidMiner’s text processing extension, X-means clustering, association rules, and a bunch of averaged attribute weighting schemes.  Since I’m not scoring any incoming tweets (this will be a later task) to see if any new tweets are important/not important, I didn’t do any classification analysis. I did create a temporary label called Important/Not Important” based on a simple rule that if Retweets > 10, then it has to be important. This is a problem because I don’t know what the actual retweet number threshold is for important (aka viral tweets) and my attribute weight chart (as above) will be a bit suspect, but it’s a start I suppose.

The Process

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="7.5.003" expanded="true" height="82" name="Retrieve Twitter Data" width="90" x="45" y="34">
        <process expanded="true">
          <operator activated="true" class="set_macros" compatibility="7.5.003" expanded="true" height="68" name="Set Macros" width="90" x="45" y="34">
            <list key="macros">
              <parameter key="keyword1" value="#machinelearning"/>
              <parameter key="keyword2" value="#datascience"/>
              <parameter key="keyword3" value="#ai"/>
              <parameter key="date" value="2017.08.08"/>
              <parameter key="retweetcount" value="5"/>
            </list>
            <description align="center" color="transparent" colored="false" width="126">Set global variables here. Such as keyword search.</description>
          </operator>
          <operator activated="false" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Twitter Content Ideas" width="90" x="179" y="340">
            <parameter key="repository_entry" value="../data/%{keyword1} Twitter Content Ideas"/>
          </operator>
          <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter for Keyword3" width="90" x="179" y="238">
            <parameter key="connection" value="Twitter - Studio Connection"/>
            <parameter key="query" value="%{keyword3}"/>
            <parameter key="limit" value="3000"/>
            <parameter key="language" value="en"/>
            <parameter key="until" value="%{date} 23:59:59 -0500"/>
          </operator>
          <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter for Keyword2" width="90" x="179" y="136">
            <parameter key="connection" value="Twitter - Studio Connection"/>
            <parameter key="query" value="%{keyword2}"/>
            <parameter key="limit" value="3000"/>
            <parameter key="language" value="en"/>
            <parameter key="until" value="%{date} 23:59:59 -0500"/>
          </operator>
          <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter for Keyword 1" width="90" x="179" y="34">
            <parameter key="connection" value="Twitter - Studio Connection"/>
            <parameter key="query" value="%{keyword1}"/>
            <parameter key="limit" value="3000"/>
            <parameter key="language" value="en"/>
            <parameter key="until" value="%{date} 23:59:59 -0500"/>
          </operator>
          <operator activated="true" class="append" compatibility="7.5.003" expanded="true" height="145" name="Append Data Set together" width="90" x="447" y="34"/>
          <operator activated="true" class="remove_duplicates" compatibility="7.5.003" expanded="true" height="103" name="Remove Duplicate IDs" width="90" x="581" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Id"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="store" compatibility="7.5.003" expanded="true" height="68" name="Store Data for later reuse" width="90" x="715" y="34">
            <parameter key="repository_entry" value="../data/%{keyword1} Twitter Content Ideas"/>
          </operator>
          <connect from_op="Search Twitter for Keyword3" from_port="output" to_op="Append Data Set together" to_port="example set 3"/>
          <connect from_op="Search Twitter for Keyword2" from_port="output" to_op="Append Data Set together" to_port="example set 2"/>
          <connect from_op="Search Twitter for Keyword 1" from_port="output" to_op="Append Data Set together" to_port="example set 1"/>
          <connect from_op="Append Data Set together" from_port="merged set" to_op="Remove Duplicate IDs" to_port="example set input"/>
          <connect from_op="Remove Duplicate IDs" from_port="example set output" to_op="Store Data for later reuse" to_port="input"/>
          <connect from_op="Store Data for later reuse" from_port="through" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Retrieves Twitter Data, Appends, and Stores</description>
      </operator>
      <operator activated="true" class="subprocess" compatibility="7.5.003" expanded="true" height="82" name="ETL Subprocess" width="90" x="179" y="34">
        <process expanded="true">
          <operator activated="true" class="remove_duplicates" compatibility="7.5.003" expanded="true" height="103" name="Remove Duplicates" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="From-User"/>
            <description align="center" color="transparent" colored="false" width="126">Remove Duplicate Tweets from same user</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="7.5.003" expanded="true" height="82" name="Generate Arbitrary Label" width="90" x="179" y="34">
            <list key="function_descriptions">
              <parameter key="label" value="if([Retweet-Count]&lt;eval(%{retweetcount}),&quot;Not Important&quot;,&quot;Important&quot;)"/>
            </list>
          </operator>
          <operator activated="false" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
            <parameter key="invert_filter" value="true"/>
            <list key="filters_list">
              <parameter key="filters_entry_key" value="Text.contains.RT"/>
            </list>
          </operator>
          <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="447" y="34">
            <parameter key="attribute_name" value="label"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
            <description align="center" color="transparent" colored="false" width="126">Set Role for Label</description>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="Text|label"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="7.5.003" expanded="true" height="82" name="Nominal to Text" width="90" x="715" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Text"/>
          </operator>
          <operator activated="true" class="extract_macro" compatibility="7.5.003" expanded="true" height="68" name="Extract Macro (3)" width="90" x="849" y="34">
            <parameter key="macro" value="label_count"/>
            <parameter key="macro_type" value="statistics"/>
            <parameter key="statistics" value="count"/>
            <parameter key="attribute_name" value="label"/>
            <parameter key="attribute_value" value="Important"/>
            <list key="additional_macros"/>
          </operator>
          <connect from_port="in 1" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_op="Generate Arbitrary Label" to_port="example set input"/>
          <connect from_op="Generate Arbitrary Label" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Extract Macro (3)" to_port="example set"/>
          <connect from_op="Extract Macro (3)" from_port="example set" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Binning for Label subprocess - suspect</description>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="34">
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prune_below_percent" value="5.0"/>
        <parameter key="prune_above_percent" value="50.0"/>
        <parameter key="prune_below_absolute" value="100"/>
        <parameter key="prune_above_absolute" value="500"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Links for later use" width="90" x="45" y="34">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries">
              <parameter key="Tweet Links" value="http.*"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <operator activated="true" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace http links" width="90" x="179" y="34">
            <list key="replace_dictionary">
              <parameter key="http.*" value="link"/>
            </list>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="313" y="34">
            <parameter key="mode" value="specify characters"/>
            <parameter key="characters" value=" .!;:[,' ?]"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="447" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="581" y="34"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="715" y="34"/>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="849" y="34">
            <parameter key="string" value="link"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="983" y="34"/>
          <connect from_port="document" to_op="Extract Links for later use" to_port="document"/>
          <connect from_op="Extract Links for later use" from_port="document" to_op="Replace http links" to_port="document"/>
          <connect from_op="Replace http links" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply" width="90" x="447" y="34"/>
      <operator activated="true" class="subprocess" compatibility="7.5.003" expanded="true" height="103" name="Clustering Stuff" width="90" x="581" y="34">
        <process expanded="true">
          <operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Remove Tweet Links" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Tweet Links"/>
            <parameter key="attributes" value="Tweet Links"/>
            <parameter key="invert_selection" value="true"/>
          </operator>
          <operator activated="true" class="x_means" compatibility="7.5.003" expanded="true" height="82" name="X-Means" width="90" x="179" y="34">
            <parameter key="measure_types" value="BregmanDivergences"/>
            <parameter key="divergence" value="SquaredEuclideanDistance"/>
          </operator>
          <operator activated="true" class="extract_prototypes" compatibility="7.5.003" expanded="true" height="82" name="Extract Cluster Prototypes" width="90" x="313" y="136"/>
          <operator activated="true" class="store" compatibility="7.5.003" expanded="true" height="68" name="Store Cluster Model" width="90" x="447" y="34">
            <parameter key="repository_entry" value="../results/%{keyword1} Twitter Content Cluster Model"/>
          </operator>
          <connect from_port="in 1" to_op="Remove Tweet Links" to_port="example set input"/>
          <connect from_op="Remove Tweet Links" from_port="example set output" to_op="X-Means" to_port="example set"/>
          <connect from_op="X-Means" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
          <connect from_op="Extract Cluster Prototypes" from_port="example set" to_op="Store Cluster Model" to_port="input"/>
          <connect from_op="Extract Cluster Prototypes" from_port="model" to_port="out 2"/>
          <connect from_op="Store Cluster Model" from_port="through" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
          <portSpacing port="sink_out 3" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="store" compatibility="7.5.003" expanded="true" height="68" name="Store WordList" width="90" x="447" y="289">
        <parameter key="repository_entry" value="../results/%{keyword1} Twitter Content Ideas Wordlist"/>
      </operator>
      <operator activated="true" class="text:wordlist_to_data" compatibility="7.5.000" expanded="true" height="82" name="WordList to Data" width="90" x="581" y="289"/>
      <operator activated="true" class="sort" compatibility="7.5.003" expanded="true" height="82" name="Sort" width="90" x="715" y="289">
        <parameter key="attribute_name" value="total"/>
        <parameter key="sorting_direction" value="decreasing"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Remove Tweet Links (2)" width="90" x="581" y="136">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Tweet Links"/>
        <parameter key="attributes" value="Tweet Links"/>
        <parameter key="invert_selection" value="true"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="7.5.003" expanded="true" height="82" name="Determine Influence Factors" width="90" x="715" y="136">
        <process expanded="true">
          <operator activated="true" class="weight_by_correlation" compatibility="7.5.003" expanded="true" height="82" name="Weight by Correlation" width="90" x="45" y="34"/>
          <operator activated="true" class="weights_to_data" compatibility="7.5.003" expanded="true" height="68" name="Weights to Data" width="90" x="179" y="34"/>
          <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="313" y="34">
            <list key="function_descriptions">
              <parameter key="Method" value="&quot;Correlation&quot;"/>
            </list>
          </operator>
          <operator activated="true" class="weight_by_gini_index" compatibility="7.5.003" expanded="true" height="82" name="Weight by Gini Index" width="90" x="45" y="120"/>
          <operator activated="true" class="weight_by_information_gain" compatibility="7.5.003" expanded="true" height="82" name="Weight by Information Gain" width="90" x="45" y="210"/>
          <operator activated="true" class="weight_by_information_gain_ratio" compatibility="7.5.003" expanded="true" height="82" name="Weight by Information Gain Ratio" width="90" x="45" y="300"/>
          <operator activated="true" class="weights_to_data" compatibility="7.5.003" expanded="true" height="68" name="Weights to Data (2)" width="90" x="179" y="120"/>
          <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes (3)" width="90" x="313" y="120">
            <list key="function_descriptions">
              <parameter key="Method" value="&quot;Gini&quot;"/>
            </list>
          </operator>
          <operator activated="true" class="weights_to_data" compatibility="7.5.003" expanded="true" height="68" name="Weights to Data (3)" width="90" x="179" y="210"/>
          <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes (4)" width="90" x="313" y="210">
            <list key="function_descriptions">
              <parameter key="Method" value="&quot;InfoGain&quot;"/>
            </list>
          </operator>
          <operator activated="true" class="weights_to_data" compatibility="7.5.003" expanded="true" height="68" name="Weights to Data (4)" width="90" x="179" y="300"/>
          <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes (5)" width="90" x="313" y="300">
            <list key="function_descriptions">
              <parameter key="Method" value="&quot;InfoGainRatio&quot;"/>
            </list>
          </operator>
          <operator activated="true" class="append" compatibility="7.5.003" expanded="true" height="145" name="Append" width="90" x="447" y="30"/>
          <operator activated="true" class="pivot" compatibility="7.5.003" expanded="true" height="82" name="Pivot" width="90" x="581" y="30">
            <parameter key="group_attribute" value="Attribute"/>
            <parameter key="index_attribute" value="Method"/>
          </operator>
          <operator activated="true" class="generate_aggregation" compatibility="6.5.002" expanded="true" height="82" name="Generate Aggregation" width="90" x="715" y="30">
            <parameter key="attribute_name" value="Importance"/>
            <parameter key="attribute_filter_type" value="value_type"/>
            <parameter key="value_type" value="numeric"/>
            <parameter key="aggregation_function" value="average"/>
          </operator>
          <operator activated="true" class="normalize" compatibility="7.5.003" expanded="true" height="103" name="Normalize" width="90" x="849" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Importance"/>
            <parameter key="method" value="range transformation"/>
          </operator>
          <operator activated="true" class="sort" compatibility="7.5.003" expanded="true" height="82" name="Sort again" width="90" x="983" y="34">
            <parameter key="attribute_name" value="Importance"/>
            <parameter key="sorting_direction" value="decreasing"/>
          </operator>
          <operator activated="true" class="order_attributes" compatibility="7.5.003" expanded="true" height="82" name="Reorder Attributes" width="90" x="1117" y="34">
            <parameter key="attribute_ordering" value="Attribute|Importance"/>
            <parameter key="handle_unmatched" value="remove"/>
          </operator>
          <operator activated="true" class="filter_example_range" compatibility="7.5.003" expanded="true" height="82" name="Select Top 20" width="90" x="1251" y="34">
            <parameter key="first_example" value="1"/>
            <parameter key="last_example" value="20"/>
          </operator>
          <connect from_port="in 1" to_op="Weight by Correlation" to_port="example set"/>
          <connect from_op="Weight by Correlation" from_port="weights" to_op="Weights to Data" to_port="attribute weights"/>
          <connect from_op="Weight by Correlation" from_port="example set" to_op="Weight by Gini Index" to_port="example set"/>
          <connect from_op="Weights to Data" from_port="example set" to_op="Generate Attributes (2)" to_port="example set input"/>
          <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Weight by Gini Index" from_port="weights" to_op="Weights to Data (2)" to_port="attribute weights"/>
          <connect from_op="Weight by Gini Index" from_port="example set" to_op="Weight by Information Gain" to_port="example set"/>
          <connect from_op="Weight by Information Gain" from_port="weights" to_op="Weights to Data (3)" to_port="attribute weights"/>
          <connect from_op="Weight by Information Gain" from_port="example set" to_op="Weight by Information Gain Ratio" to_port="example set"/>
          <connect from_op="Weight by Information Gain Ratio" from_port="weights" to_op="Weights to Data (4)" to_port="attribute weights"/>
          <connect from_op="Weights to Data (2)" from_port="example set" to_op="Generate Attributes (3)" to_port="example set input"/>
          <connect from_op="Generate Attributes (3)" from_port="example set output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Weights to Data (3)" from_port="example set" to_op="Generate Attributes (4)" to_port="example set input"/>
          <connect from_op="Generate Attributes (4)" from_port="example set output" to_op="Append" to_port="example set 3"/>
          <connect from_op="Weights to Data (4)" from_port="example set" to_op="Generate Attributes (5)" to_port="example set input"/>
          <connect from_op="Generate Attributes (5)" from_port="example set output" to_op="Append" to_port="example set 4"/>
          <connect from_op="Append" from_port="merged set" to_op="Pivot" to_port="example set input"/>
          <connect from_op="Pivot" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/>
          <connect from_op="Generate Aggregation" from_port="example set output" to_op="Normalize" to_port="example set input"/>
          <connect from_op="Normalize" from_port="example set output" to_op="Sort again" to_port="example set input"/>
          <connect from_op="Sort again" from_port="example set output" to_op="Reorder Attributes" to_port="example set input"/>
          <connect from_op="Reorder Attributes" from_port="example set output" to_op="Select Top 20" to_port="example set input"/>
          <connect from_op="Select Top 20" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="store" compatibility="7.5.003" expanded="true" height="68" name="Store Influence Wrds" width="90" x="849" y="136">
        <parameter key="repository_entry" value="../results/%{keyword1} Twitter Content Influence Words"/>
      </operator>
      <operator activated="true" class="write_excel" compatibility="7.5.003" expanded="true" height="82" name="Write Important Words" width="90" x="983" y="136">
        <parameter key="excel_file" value="C:\Users\Thomas Ott\Dropbox\Twitter Influencers\%{keyword1} Todays Powerful Words to use in your Tweets.xlsx"/>
      </operator>
      <connect from_op="Retrieve Twitter Data" from_port="out 1" to_op="ETL Subprocess" to_port="in 1"/>
      <connect from_op="ETL Subprocess" from_port="out 1" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Multiply" to_port="input"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Store WordList" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Clustering Stuff" to_port="in 1"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Remove Tweet Links (2)" to_port="example set input"/>
      <connect from_op="Clustering Stuff" from_port="out 1" to_port="result 1"/>
      <connect from_op="Clustering Stuff" from_port="out 2" to_port="result 2"/>
      <connect from_op="Store WordList" from_port="through" to_op="WordList to Data" to_port="word list"/>
      <connect from_op="WordList to Data" from_port="example set" to_op="Sort" to_port="example set input"/>
      <connect from_op="Sort" from_port="example set output" to_port="result 4"/>
      <connect from_op="Remove Tweet Links (2)" from_port="example set output" to_op="Determine Influence Factors" to_port="in 1"/>
      <connect from_op="Determine Influence Factors" from_port="out 1" to_op="Store Influence Wrds" to_port="input"/>
      <connect from_op="Store Influence Wrds" from_port="through" to_op="Write Important Words" to_port="input"/>
      <connect from_op="Write Important Words" from_port="through" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="63"/>
      <portSpacing port="sink_result 3" spacing="126"/>
      <portSpacing port="sink_result 4" spacing="84"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

For this particular process I shared, I used a Macro to set the search terms to #machinelearning, #datascience, and #ai. When you run this process over and over, you’ll see some interesting Tweeters emerge.

Next Steps

My next steps are to figure out the actual retweet # that truly indicates whether a tweet is important and viral and what is not. I might write a one class auto-labeling process or just hand label some important and non-important tweets. That will hone down the process and let me really figure out what his the best number to wat

Posted by Thomas Ott /

Don't forget to sign up for our monthly newsletter on Data Science and RapidMiner here!


Product Marketing RapidMiner Tutorials tutorials


Previous post
Is it Possible to Automate Data Science? A few months ago I read about a programmer that automated his job down to the point where the coffee machine would make him lattes! Despite the
Next post
Keras and NLTK I’ve been doing a lot more Python hacking, especially around text mining and using the deep learning library Keras and NLTK. Normally I’d do most of