Tag Tutorials

Posts: 19

Extract Ernst Hemingway Quotes from Goodreads

Here's a fast and simple process to extract Ernst Hemingway Quotes from Goodreads. The process is not done, I still need to loop over each quote and add 1 day to the %{now} macro. The goal is to then write them in markdown with %{now}+1 day and auto schedule them on my other website (thomasott.io).

Right now the Goodreads.com web structure is easy to extract but I suspect they'll make it harder one day.

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="8.1.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
            <list key="attribute_values">
              <parameter key="get_date" value="date_now()"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="date_to_nominal" compatibility="8.1.000" expanded="true" height="82" name="Date to Nominal" width="90" x="179" y="34">
            <parameter key="attribute_name" value="get_date"/>
            <parameter key="date_format" value="yyyy-MM-dd"/>
          </operator>
          <operator activated="true" class="extract_macro" compatibility="8.1.000" expanded="true" height="68" name="Extract Macro" width="90" x="313" y="34">
            <parameter key="macro" value="now"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="get_date"/>
            <parameter key="example_index" value="1"/>
            <list key="additional_macros"/>
          </operator>
          <operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="447" y="34">
            <parameter key="url" value="https://www.goodreads.com/work/quotes/2459084-a-moveable-feast"/>
            <list key="query_parameters"/>
            <list key="request_properties"/>
            <description align="center" color="transparent" colored="false" width="126">Read Moveable Feast Quotes</description>
          </operator>
          <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="581" y="34">
            <list key="string_machting_queries">
              <parameter key="MoveableFeastQuotes" value="<div class=&quot;quoteText&quot;>.</div>"/>
            </list>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="<div class =&quot;quoteText&quot;>" value="</div>"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
            <process expanded="true">
              <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34"/>
              <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="179" y="34">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries">
                  <parameter key="QuoteText" value="&quot;.&quot;"/>
                </list>
                <list key="regular_expression_queries">
                  <parameter key="QuoteText" value="\&quot;.*\&quot;"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
                <list key="jsonpath_queries"/>
              </operator>
              <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="313" y="34">
                <parameter key="text_attribute" value="ExtractedText"/>
              </operator>
              <operator activated="true" class="extract_macro" compatibility="8.1.000" expanded="true" height="68" name="Extract Quote" width="90" x="447" y="34">
                <parameter key="macro" value="Quote"/>
                <parameter key="macro_type" value="data_value"/>
                <parameter key="attribute_name" value="ExtractedText"/>
                <parameter key="example_index" value="1"/>
                <list key="additional_macros"/>
              </operator>
              <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="648" y="34">
                <parameter key="text" value="Title: Hemingway Quote for %{now}&#10;Date: %{now}&#10;&#10;#Hemingway Quote for %{now}&#10;&#10;%{Quote}"/>
              </operator>
              <connect from_port="segment" to_op="Extract Content" to_port="document"/>
              <connect from_op="Extract Content" from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
              <connect from_op="Documents to Data" from_port="example set" to_op="Extract Quote" to_port="example set"/>
              <connect from_op="Create Document" from_port="output" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Date to Nominal" to_port="example set input"/>
          <connect from_op="Date to Nominal" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

comments

Use RapidMiner to Auto-label a Twitter Training Set

I've been struggling with how to separate the signal from noise in Twitter data. There's great content and sentiment there but it's buried by nonsense tweets and noise. How do you find that true signal within the noise?

This question wracked my brain until I solved it with a one-class SVM application in RapidMiner.

Autolabeling a Training Set

If you read and use my RapidMiner Twitter Content process from here, you would've noted that process didn't include any labels. The labels were something left "to do" at the end of the tutorial and I spent a few days thinking on how to go about it. My first method was to label tweets based on Retweets and the second method was to label tweets based on Binning. Both of these methods are easy but they didn't solve the problem at hand.

Labeling based on Retweets

With this method I created a label class of "Important" and "Not Important" based on the number of retweets a post had. This was the simplest way to cut the traning set into two classes but I had to choose an arbitrary Retweet value. What was the right number of Retweets? If you look at the tweets surrounding #machinelearning, #ai, and #datascience you'll notice that a large amount retweets happen from a small handful of 'Twitterati'. Not to pick on @KirkDBorne but when he Tweets something, bots and people Retweet it like crazy.

There's a large percentage of the tweets he sends that links back to content that's been posted or generated elsewhere. He happens to have a large following that Retweets his stuff like crazy. His Retweets can range in the 100's, so does this mean those Tweets are 'Important' or a lot of noise? If some Tweet only has 10 Retweets but it's a great link, does that mean it's "Not Important? So what's the right number of retweets? One? Ten? One Hundred? There was no good answer here because I didn't know what the right number was.

Labeling based on Binning

My next thought was to bin the tweets based on their Retweets into two buckets. Bucket one would be "Not Important" and bucket two would be "Important." When I did this, I started getting a distribution that look better. It wasn't till I examined the buckets that I realized that this method gleaned over a lot of good tweets.

In essence I was repeating the same mistakes as labeling based on Retweets. So if I trained a model on this, I'd still get shit.

Labeling based on a One Class SVM

I realized after trying the above two methods that there was no easy to do it. I wanted to find a lazy way of autolabeling but soon came back what is important, the training set.

The power and accuracy of any classification model depends on how good its training set is. Never overlook this!

The solution was to use a One Class SVM process in RapidMiner. I would get a set of 100 to 200 Tweets, read through them, and then ONLY label the 'Important' ones. What were the 'Important' ones? Any Tweet that I thought was interesting to me and my followers.

After I marked the Important Tweets, I imported that data set into RapidMiner and built my process. The process is simple.

The top process branch loads the hand labeled data, does some Text Processing on it, and feeds it into a SVM set with a One-Class kernel. Now the following is important!

The use a One Class SVM in RapidMiner, you have to train it only on one class, 'Important' being that class. When you apply the model to out of sample (OOS) data, it will generate an 'inside' and 'outside' prediction with confidence values. These values show how close the new data point is inside the 'Important' class (meaning it's Important) or outside of that class. I end up renaming the 'inside' and 'outside' predictions to 'Important' and 'Not Important' .

The bottom process takes the OOS data, text processes it, and applys the model for prediction. At the end I do some cleanup where I merge the Tweets together so I can feed it into my Twitter Content model and find my Important words and actually build a classification model now!

Within a few seconds, I had an autolabeled data set! YAH!

Caution

While this process is a GREAT first start, there is more work to do. For example, I selected an RBF kernel and a gammas of 0.001 as a starting point. This was a guess and I need to put together and optimization process to help me figure out that right parameters to use to get a better autolabeling model. I'm also interested in using @mschmitz_'s LIME operator to help me understand the potential outliers when using this autolabeling method.

The Process

As I noted above, this process is a 'work in proces' so use with caution. It's a great blueprint because applying One Class SVM's in RapidMiner is easy but sometimes confusing.

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="open_file" compatibility="7.6.001" expanded="true" height="68" name="Open File" width="90" x="45" y="136">
        <parameter key="resource_type" value="URL"/>
        <parameter key="url" value="http://www.neuralmarkettrends.com/public/2017/tempfile.xlsx"/>
      </operator>
      <operator activated="true" class="read_excel" compatibility="7.6.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="238">
        <parameter key="connection" value="Twitter - Studio Connection"/>
        <parameter key="query" value="machinelearning &amp;&amp; ai &amp;&amp; datascience"/>
        <parameter key="result_type" value="recent"/>
        <parameter key="limit" value="3000"/>
        <parameter key="language" value="en"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select label and text 2" width="90" x="179" y="238">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Text|label|Retweet-Count"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply" width="90" x="313" y="289"/>
      <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text 2" width="90" x="447" y="187">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter missing labels" width="90" x="179" y="34">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="label.is_not_missing."/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select label and text" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Text|label|Retweet-Count"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text" width="90" x="447" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Label" width="90" x="581" y="34">
        <parameter key="attribute_name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="715" y="34">
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prune_below_percent" value="5.0"/>
        <parameter key="prune_above_percent" value="50.0"/>
        <parameter key="prune_below_absolute" value="100"/>
        <parameter key="prune_above_absolute" value="500"/>
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="Text" value="2.0"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace Tokens (2)" width="90" x="45" y="34">
            <list key="replace_dictionary">
              <parameter key="http.*" value="link"/>
            </list>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
            <parameter key="mode" value="specify characters"/>
            <parameter key="characters" value=" .!;:[,"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="581" y="34"/>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="715" y="34">
            <parameter key="string" value="link"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="849" y="34"/>
          <connect from_port="document" to_op="Replace Tokens (2)" to_port="document"/>
          <connect from_op="Replace Tokens (2)" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Missing Att" width="90" x="849" y="34">
        <parameter key="condition_class" value="no_missing_attributes"/>
        <list key="filters_list"/>
      </operator>
      <operator activated="true" class="support_vector_machine_libsvm" compatibility="7.6.001" expanded="true" height="82" name="SVM" width="90" x="983" y="34">
        <parameter key="svm_type" value="one-class"/>
        <parameter key="gamma" value="0.001"/>
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="715" y="187">
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prune_below_percent" value="5.0"/>
        <parameter key="prune_above_percent" value="50.0"/>
        <parameter key="prune_below_absolute" value="100"/>
        <parameter key="prune_above_absolute" value="500"/>
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="Text" value="2.0"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace Tokens (3)" width="90" x="45" y="34">
            <list key="replace_dictionary">
              <parameter key="http.*" value="link"/>
            </list>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="179" y="34">
            <parameter key="mode" value="specify characters"/>
            <parameter key="characters" value=" .!;:[,"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="313" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="447" y="34"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (2)" width="90" x="581" y="34"/>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="715" y="34">
            <parameter key="string" value="link"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="849" y="34"/>
          <connect from_port="document" to_op="Replace Tokens (3)" to_port="document"/>
          <connect from_op="Replace Tokens (3)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
          <connect from_op="Generate n-Grams (2)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
          <connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Missing att 2" width="90" x="849" y="187">
        <parameter key="condition_class" value="no_missing_attributes"/>
        <list key="filters_list"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="1117" y="34">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="join" compatibility="7.6.001" expanded="true" height="82" name="Join" width="90" x="1184" y="289">
        <list key="key_attributes"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Final Set" width="90" x="1318" y="289">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="prediction(label)|Text|Retweet-Count"/>
      </operator>
      <operator activated="true" class="store" compatibility="7.6.001" expanded="true" height="68" name="Store" width="90" x="1452" y="289">
        <parameter key="repository_entry" value="../data/Twitter Content Enriched Data Set"/>
      </operator>
      <connect from_op="Open File" from_port="file" to_op="Read Excel" to_port="file"/>
      <connect from_op="Read Excel" from_port="output" to_op="Filter missing labels" to_port="example set input"/>
      <connect from_op="Search Twitter" from_port="output" to_op="Select label and text 2" to_port="example set input"/>
      <connect from_op="Select label and text 2" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Nominal to Text 2" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
      <connect from_op="Nominal to Text 2" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Filter missing labels" from_port="example set output" to_op="Select label and text" to_port="example set input"/>
      <connect from_op="Select label and text" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Set Label" to_port="example set input"/>
      <connect from_op="Set Label" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Filter Missing Att" to_port="example set input"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
      <connect from_op="Filter Missing Att" from_port="example set output" to_op="SVM" to_port="training set"/>
      <connect from_op="SVM" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Filter Missing att 2" to_port="example set input"/>
      <connect from_op="Filter Missing att 2" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Join" to_port="left"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 1"/>
      <connect from_op="Join" from_port="join" to_op="Select Final Set" to_port="example set input"/>
      <connect from_op="Select Final Set" from_port="example set output" to_op="Store" to_port="input"/>
      <connect from_op="Store" from_port="through" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

comments

Use RapidMiner to Discover Twitter Content

Welcome to this new tutorial on how to use RapidMiner to discover Twitter Content. I created this process as a way to monitor what's going on in the Twitter universe and see what topics are being tweeted about. Could I do this in Python scripts? Yes, but that would be a big waste of time for me. RapidMiner makes complex ETL and tasks simple, so I live and breathe it.

Why this process?

Back when I was in Product Marketing, I had to come up with many different blog posts and 'collateral' to help push the RapidMiner cause. I monitor what goes on on KD Nuggets, DataScience Central, and of course Twitter. I thought, it would be fun to extract key terms and subjects from Twitter (and later websites) to see what's currently popular and help make a 'bigger splash' when we publish something new. I've since applied this model to my new website Yeast Head to see what beer brewing lifestyle bloggers are posting about. The short end of that discussion is that the terms '#recipes' and '#homebrew_#recipes' are most popular. So I need to make sure to include some recipies going forward.  Interestingly enough, there's a lot of retweets with respect to Homebrewer's Association, so I'll be exploiting that for sure.

The Process Design

This process utilizes RapidMiner's text processing extension, X-means clustering, association rules, and a bunch of averaged attribute weighting schemes.  Since I'm not scoring any incoming tweets (this will be a later task) to see if any new tweets are important/not important, I didn't do any classification analysis. I did create a temporary label called "Important/Not Important" based on a simple rule that if Retweets > 10, then it has to be important. This is a problem because I don't know what the actual retweet number threshold is for important (aka viral tweets) and my attribute weight chart (as above) will be a bit suspect, but it's a start I suppose.

The Process

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="7.5.003" expanded="true" height="82" name="Retrieve Twitter Data" width="90" x="45" y="34">
        <process expanded="true">
          <operator activated="true" class="set_macros" compatibility="7.5.003" expanded="true" height="68" name="Set Macros" width="90" x="45" y="34">
            <list key="macros">
              <parameter key="keyword1" value="#machinelearning"/>
              <parameter key="keyword2" value="#datascience"/>
              <parameter key="keyword3" value="#ai"/>
              <parameter key="date" value="2017.08.08"/>
              <parameter key="retweetcount" value="5"/>
            </list>
            <description align="center" color="transparent" colored="false" width="126">Set global variables here. Such as keyword search.</description>
          </operator>
          <operator activated="false" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Twitter Content Ideas" width="90" x="179" y="340">
            <parameter key="repository_entry" value="../data/%{keyword1} Twitter Content Ideas"/>
          </operator>
          <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter for Keyword3" width="90" x="179" y="238">
            <parameter key="connection" value="Twitter - Studio Connection"/>
            <parameter key="query" value="%{keyword3}"/>
            <parameter key="limit" value="3000"/>
            <parameter key="language" value="en"/>
            <parameter key="until" value="%{date} 23:59:59 -0500"/>
          </operator>
          <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter for Keyword2" width="90" x="179" y="136">
            <parameter key="connection" value="Twitter - Studio Connection"/>
            <parameter key="query" value="%{keyword2}"/>
            <parameter key="limit" value="3000"/>
            <parameter key="language" value="en"/>
            <parameter key="until" value="%{date} 23:59:59 -0500"/>
          </operator>
          <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter for Keyword 1" width="90" x="179" y="34">
            <parameter key="connection" value="Twitter - Studio Connection"/>
            <parameter key="query" value="%{keyword1}"/>
            <parameter key="limit" value="3000"/>
            <parameter key="language" value="en"/>
            <parameter key="until" value="%{date} 23:59:59 -0500"/>
          </operator>
          <operator activated="true" class="append" compatibility="7.5.003" expanded="true" height="145" name="Append Data Set together" width="90" x="447" y="34"/>
          <operator activated="true" class="remove_duplicates" compatibility="7.5.003" expanded="true" height="103" name="Remove Duplicate IDs" width="90" x="581" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Id"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="store" compatibility="7.5.003" expanded="true" height="68" name="Store Data for later reuse" width="90" x="715" y="34">
            <parameter key="repository_entry" value="../data/%{keyword1} Twitter Content Ideas"/>
          </operator>
          <connect from_op="Search Twitter for Keyword3" from_port="output" to_op="Append Data Set together" to_port="example set 3"/>
          <connect from_op="Search Twitter for Keyword2" from_port="output" to_op="Append Data Set together" to_port="example set 2"/>
          <connect from_op="Search Twitter for Keyword 1" from_port="output" to_op="Append Data Set together" to_port="example set 1"/>
          <connect from_op="Append Data Set together" from_port="merged set" to_op="Remove Duplicate IDs" to_port="example set input"/>
          <connect from_op="Remove Duplicate IDs" from_port="example set output" to_op="Store Data for later reuse" to_port="input"/>
          <connect from_op="Store Data for later reuse" from_port="through" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Retrieves Twitter Data, Appends, and Stores</description>
      </operator>
      <operator activated="true" class="subprocess" compatibility="7.5.003" expanded="true" height="82" name="ETL Subprocess" width="90" x="179" y="34">
        <process expanded="true">
          <operator activated="true" class="remove_duplicates" compatibility="7.5.003" expanded="true" height="103" name="Remove Duplicates" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="From-User"/>
            <description align="center" color="transparent" colored="false" width="126">Remove Duplicate Tweets from same user</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="7.5.003" expanded="true" height="82" name="Generate Arbitrary Label" width="90" x="179" y="34">
            <list key="function_descriptions">
              <parameter key="label" value="if([Retweet-Count]&lt;eval(%{retweetcount}),&quot;Not Important&quot;,&quot;Important&quot;)"/>
            </list>
          </operator>
          <operator activated="false" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
            <parameter key="invert_filter" value="true"/>
            <list key="filters_list">
              <parameter key="filters_entry_key" value="Text.contains.RT"/>
            </list>
          </operator>
          <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="447" y="34">
            <parameter key="attribute_name" value="label"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
            <description align="center" color="transparent" colored="false" width="126">Set Role for Label</description>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="Text|label"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="7.5.003" expanded="true" height="82" name="Nominal to Text" width="90" x="715" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Text"/>
          </operator>
          <operator activated="true" class="extract_macro" compatibility="7.5.003" expanded="true" height="68" name="Extract Macro (3)" width="90" x="849" y="34">
            <parameter key="macro" value="label_count"/>
            <parameter key="macro_type" value="statistics"/>
            <parameter key="statistics" value="count"/>
            <parameter key="attribute_name" value="label"/>
            <parameter key="attribute_value" value="Important"/>
            <list key="additional_macros"/>
          </operator>
          <connect from_port="in 1" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_op="Generate Arbitrary Label" to_port="example set input"/>
          <connect from_op="Generate Arbitrary Label" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Extract Macro (3)" to_port="example set"/>
          <connect from_op="Extract Macro (3)" from_port="example set" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Binning for Label subprocess - suspect</description>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="34">
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prune_below_percent" value="5.0"/>
        <parameter key="prune_above_percent" value="50.0"/>
        <parameter key="prune_below_absolute" value="100"/>
        <parameter key="prune_above_absolute" value="500"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Links for later use" width="90" x="45" y="34">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries">
              <parameter key="Tweet Links" value="http.*"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <operator activated="true" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace http links" width="90" x="179" y="34">
            <list key="replace_dictionary">
              <parameter key="http.*" value="link"/>
            </list>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="313" y="34">
            <parameter key="mode" value="specify characters"/>
            <parameter key="characters" value=" .!;:[,' ?]"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="447" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="581" y="34"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="715" y="34"/>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="849" y="34">
            <parameter key="string" value="link"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="983" y="34"/>
          <connect from_port="document" to_op="Extract Links for later use" to_port="document"/>
          <connect from_op="Extract Links for later use" from_port="document" to_op="Replace http links" to_port="document"/>
          <connect from_op="Replace http links" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply" width="90" x="447" y="34"/>
      <operator activated="true" class="subprocess" compatibility="7.5.003" expanded="true" height="103" name="Clustering Stuff" width="90" x="581" y="34">
        <process expanded="true">
          <operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Remove Tweet Links" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Tweet Links"/>
            <parameter key="attributes" value="Tweet Links"/>
            <parameter key="invert_selection" value="true"/>
          </operator>
          <operator activated="true" class="x_means" compatibility="7.5.003" expanded="true" height="82" name="X-Means" width="90" x="179" y="34">
            <parameter key="measure_types" value="BregmanDivergences"/>
            <parameter key="divergence" value="SquaredEuclideanDistance"/>
          </operator>
          <operator activated="true" class="extract_prototypes" compatibility="7.5.003" expanded="true" height="82" name="Extract Cluster Prototypes" width="90" x="313" y="136"/>
          <operator activated="true" class="store" compatibility="7.5.003" expanded="true" height="68" name="Store Cluster Model" width="90" x="447" y="34">
            <parameter key="repository_entry" value="../results/%{keyword1} Twitter Content Cluster Model"/>
          </operator>
          <connect from_port="in 1" to_op="Remove Tweet Links" to_port="example set input"/>
          <connect from_op="Remove Tweet Links" from_port="example set output" to_op="X-Means" to_port="example set"/>
          <connect from_op="X-Means" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
          <connect from_op="Extract Cluster Prototypes" from_port="example set" to_op="Store Cluster Model" to_port="input"/>
          <connect from_op="Extract Cluster Prototypes" from_port="model" to_port="out 2"/>
          <connect from_op="Store Cluster Model" from_port="through" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
          <portSpacing port="sink_out 3" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="store" compatibility="7.5.003" expanded="true" height="68" name="Store WordList" width="90" x="447" y="289">
        <parameter key="repository_entry" value="../results/%{keyword1} Twitter Content Ideas Wordlist"/>
      </operator>
      <operator activated="true" class="text:wordlist_to_data" compatibility="7.5.000" expanded="true" height="82" name="WordList to Data" width="90" x="581" y="289"/>
      <operator activated="true" class="sort" compatibility="7.5.003" expanded="true" height="82" name="Sort" width="90" x="715" y="289">
        <parameter key="attribute_name" value="total"/>
        <parameter key="sorting_direction" value="decreasing"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Remove Tweet Links (2)" width="90" x="581" y="136">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Tweet Links"/>
        <parameter key="attributes" value="Tweet Links"/>
        <parameter key="invert_selection" value="true"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="7.5.003" expanded="true" height="82" name="Determine Influence Factors" width="90" x="715" y="136">
        <process expanded="true">
          <operator activated="true" class="weight_by_correlation" compatibility="7.5.003" expanded="true" height="82" name="Weight by Correlation" width="90" x="45" y="34"/>
          <operator activated="true" class="weights_to_data" compatibility="7.5.003" expanded="true" height="68" name="Weights to Data" width="90" x="179" y="34"/>
          <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="313" y="34">
            <list key="function_descriptions">
              <parameter key="Method" value="&quot;Correlation&quot;"/>
            </list>
          </operator>
          <operator activated="true" class="weight_by_gini_index" compatibility="7.5.003" expanded="true" height="82" name="Weight by Gini Index" width="90" x="45" y="120"/>
          <operator activated="true" class="weight_by_information_gain" compatibility="7.5.003" expanded="true" height="82" name="Weight by Information Gain" width="90" x="45" y="210"/>
          <operator activated="true" class="weight_by_information_gain_ratio" compatibility="7.5.003" expanded="true" height="82" name="Weight by Information Gain Ratio" width="90" x="45" y="300"/>
          <operator activated="true" class="weights_to_data" compatibility="7.5.003" expanded="true" height="68" name="Weights to Data (2)" width="90" x="179" y="120"/>
          <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes (3)" width="90" x="313" y="120">
            <list key="function_descriptions">
              <parameter key="Method" value="&quot;Gini&quot;"/>
            </list>
          </operator>
          <operator activated="true" class="weights_to_data" compatibility="7.5.003" expanded="true" height="68" name="Weights to Data (3)" width="90" x="179" y="210"/>
          <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes (4)" width="90" x="313" y="210">
            <list key="function_descriptions">
              <parameter key="Method" value="&quot;InfoGain&quot;"/>
            </list>
          </operator>
          <operator activated="true" class="weights_to_data" compatibility="7.5.003" expanded="true" height="68" name="Weights to Data (4)" width="90" x="179" y="300"/>
          <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes (5)" width="90" x="313" y="300">
            <list key="function_descriptions">
              <parameter key="Method" value="&quot;InfoGainRatio&quot;"/>
            </list>
          </operator>
          <operator activated="true" class="append" compatibility="7.5.003" expanded="true" height="145" name="Append" width="90" x="447" y="30"/>
          <operator activated="true" class="pivot" compatibility="7.5.003" expanded="true" height="82" name="Pivot" width="90" x="581" y="30">
            <parameter key="group_attribute" value="Attribute"/>
            <parameter key="index_attribute" value="Method"/>
          </operator>
          <operator activated="true" class="generate_aggregation" compatibility="6.5.002" expanded="true" height="82" name="Generate Aggregation" width="90" x="715" y="30">
            <parameter key="attribute_name" value="Importance"/>
            <parameter key="attribute_filter_type" value="value_type"/>
            <parameter key="value_type" value="numeric"/>
            <parameter key="aggregation_function" value="average"/>
          </operator>
          <operator activated="true" class="normalize" compatibility="7.5.003" expanded="true" height="103" name="Normalize" width="90" x="849" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Importance"/>
            <parameter key="method" value="range transformation"/>
          </operator>
          <operator activated="true" class="sort" compatibility="7.5.003" expanded="true" height="82" name="Sort again" width="90" x="983" y="34">
            <parameter key="attribute_name" value="Importance"/>
            <parameter key="sorting_direction" value="decreasing"/>
          </operator>
          <operator activated="true" class="order_attributes" compatibility="7.5.003" expanded="true" height="82" name="Reorder Attributes" width="90" x="1117" y="34">
            <parameter key="attribute_ordering" value="Attribute|Importance"/>
            <parameter key="handle_unmatched" value="remove"/>
          </operator>
          <operator activated="true" class="filter_example_range" compatibility="7.5.003" expanded="true" height="82" name="Select Top 20" width="90" x="1251" y="34">
            <parameter key="first_example" value="1"/>
            <parameter key="last_example" value="20"/>
          </operator>
          <connect from_port="in 1" to_op="Weight by Correlation" to_port="example set"/>
          <connect from_op="Weight by Correlation" from_port="weights" to_op="Weights to Data" to_port="attribute weights"/>
          <connect from_op="Weight by Correlation" from_port="example set" to_op="Weight by Gini Index" to_port="example set"/>
          <connect from_op="Weights to Data" from_port="example set" to_op="Generate Attributes (2)" to_port="example set input"/>
          <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Weight by Gini Index" from_port="weights" to_op="Weights to Data (2)" to_port="attribute weights"/>
          <connect from_op="Weight by Gini Index" from_port="example set" to_op="Weight by Information Gain" to_port="example set"/>
          <connect from_op="Weight by Information Gain" from_port="weights" to_op="Weights to Data (3)" to_port="attribute weights"/>
          <connect from_op="Weight by Information Gain" from_port="example set" to_op="Weight by Information Gain Ratio" to_port="example set"/>
          <connect from_op="Weight by Information Gain Ratio" from_port="weights" to_op="Weights to Data (4)" to_port="attribute weights"/>
          <connect from_op="Weights to Data (2)" from_port="example set" to_op="Generate Attributes (3)" to_port="example set input"/>
          <connect from_op="Generate Attributes (3)" from_port="example set output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Weights to Data (3)" from_port="example set" to_op="Generate Attributes (4)" to_port="example set input"/>
          <connect from_op="Generate Attributes (4)" from_port="example set output" to_op="Append" to_port="example set 3"/>
          <connect from_op="Weights to Data (4)" from_port="example set" to_op="Generate Attributes (5)" to_port="example set input"/>
          <connect from_op="Generate Attributes (5)" from_port="example set output" to_op="Append" to_port="example set 4"/>
          <connect from_op="Append" from_port="merged set" to_op="Pivot" to_port="example set input"/>
          <connect from_op="Pivot" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/>
          <connect from_op="Generate Aggregation" from_port="example set output" to_op="Normalize" to_port="example set input"/>
          <connect from_op="Normalize" from_port="example set output" to_op="Sort again" to_port="example set input"/>
          <connect from_op="Sort again" from_port="example set output" to_op="Reorder Attributes" to_port="example set input"/>
          <connect from_op="Reorder Attributes" from_port="example set output" to_op="Select Top 20" to_port="example set input"/>
          <connect from_op="Select Top 20" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="store" compatibility="7.5.003" expanded="true" height="68" name="Store Influence Wrds" width="90" x="849" y="136">
        <parameter key="repository_entry" value="../results/%{keyword1} Twitter Content Influence Words"/>
      </operator>
      <operator activated="true" class="write_excel" compatibility="7.5.003" expanded="true" height="82" name="Write Important Words" width="90" x="983" y="136">
        <parameter key="excel_file" value="C:\Users\Thomas Ott\Dropbox\Twitter Influencers\%{keyword1} Todays Powerful Words to use in your Tweets.xlsx"/>
      </operator>
      <connect from_op="Retrieve Twitter Data" from_port="out 1" to_op="ETL Subprocess" to_port="in 1"/>
      <connect from_op="ETL Subprocess" from_port="out 1" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Multiply" to_port="input"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Store WordList" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Clustering Stuff" to_port="in 1"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Remove Tweet Links (2)" to_port="example set input"/>
      <connect from_op="Clustering Stuff" from_port="out 1" to_port="result 1"/>
      <connect from_op="Clustering Stuff" from_port="out 2" to_port="result 2"/>
      <connect from_op="Store WordList" from_port="through" to_op="WordList to Data" to_port="word list"/>
      <connect from_op="WordList to Data" from_port="example set" to_op="Sort" to_port="example set input"/>
      <connect from_op="Sort" from_port="example set output" to_port="result 4"/>
      <connect from_op="Remove Tweet Links (2)" from_port="example set output" to_op="Determine Influence Factors" to_port="in 1"/>
      <connect from_op="Determine Influence Factors" from_port="out 1" to_op="Store Influence Wrds" to_port="input"/>
      <connect from_op="Store Influence Wrds" from_port="through" to_op="Write Important Words" to_port="input"/>
      <connect from_op="Write Important Words" from_port="through" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="63"/>
      <portSpacing port="sink_result 3" spacing="126"/>
      <portSpacing port="sink_result 4" spacing="84"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

For this particular process I shared, I used a Macro to set the search terms to #machinelearning, #datascience, and #ai. When you run this process over and over, you'll see some interesting Tweeters emerge.

Next Steps

My next steps are to figure out the actual retweet # that truly indicates whether a tweet is important and viral and what is not. I might write a one class auto-labeling process or just hand label some important and non-important tweets. That will hone down the process and let me really figure out what his the best number to wat

comments

Mashing Up Julia Language with RapidMiner

If you want to execute any Python in RapidMiner, you have to use the Execute Python operator. This operator makes things so simple that people use the hell out o fit. However, it wasn’t so simple in the “old days.” It could still be done but it required more effort, and that’s what I did with the Julia Language. I mashed up the Julia Language with RapidMiner with only a few extra steps.

The way we mashed up other programs and languages in the old days was to use the Execute Program operator. That operator let’s you execute arbitrary programs in RapidMiner within the process. Want to kick of some Java at run time? You could do it. Want to use Python? You could do that (and I did) too!

The best part? You can still use this operator today and that’s what I did with Julia. Mind you, this is tutorial is a simple Proof of Concept, I didn’t do anything fancy, but it works.

What I did was take some a RapidMiner sample data set (Golf) and pass it to a Julia script that writes it out as a CSV file. I save the CSV file to a working directory defined by the Julia script.

Tutorial Processes

A few prerequisites, you’ll need RapidMiner and Julia installed. Make sure your Julia path is correct in your environment variables. I had some trouble in Windows with this but it worked fine after I fixed it.

Below you’ll find the XML for the Rapidminer process and the simple Julia script. I named the script read.jl and called from my Dropbox, you’ll need to repath this on your computer.

The RapidMiner Process

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/data/Golf"/>
      </operator>
      <operator activated="true" class="write_csv" compatibility="7.4.000" expanded="true" height="82" name="Write CSV" width="90" x="246" y="34">
        <parameter key="csv_file" value="read"/>
        <parameter key="column_separator" value=","/>
      </operator>
      <operator activated="true" class="productivity:execute_program" compatibility="7.4.000" expanded="true" height="103" name="Execute Program" width="90" x="447" y="34">
        <parameter key="command" value="julia read.jl"/>
        <parameter key="working_directory" value="C:\Users\ThomasOtt\Dropbox\Julia"/>
        <list key="env_variables"/>
      </operator>
      <connect from_op="Retrieve Golf" from_port="output" to_op="Write CSV" to_port="input"/>
      <connect from_op="Write CSV" from_port="file" to_op="Execute Program" to_port="in"/>
      <connect from_op="Execute Program" from_port="out" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

The Julia Language script

using DataFrames

df = readtable(STDIN)

writetable("output.csv", df, separator = ',', header = true)

Note: You’ll need to “Pkg.add(“Dataframes”)” to Julia first.

Of course the next steps is to write a more defined Julia script, pass the data back INTO RapidMiner, and then continue processing it downstream.

comments

Latest Writings Elsewhere for September 2016

Just a quick list of the content I've created some place other than this blog. This current list is 100% RapidMiner related but I'd like to branch out into guest posting. If any of my readers would like a guest post on their blog or site from me, then contact me (tom %at% neuralmarkettrends %dot% com).

comments