web
stats
September 9, 2017

Use RapidMiner to Auto Label a Twitter Training Set

I’ve been struggling with how to separate the signal from noise in Twitter data. There’s great content and sentiment there but it’s buried by nonsense tweets and noise. How do you find that true signal within the noise?

This question wracked my brain until I solved it with a one-class SVM application in RapidMiner.

Autolabeling a Training Set

If you read and use my RapidMiner Twitter Content process from here, you would’ve noted that process didn’t include any labels. The labels were something left to do” at the end of the tutorial and I spent a few days thinking on how to go about it. My first method was to label tweets based on Retweets and the second method was to label tweets based on Binning. Both of these methods are easy but they didn’t solve the problem at hand.

Labeling based on Retweets

With this method I created a label class of Important” and Not Important” based on the number of retweets a post had. This was the simplest way to cut the traning set into two classes but I had to choose an arbitrary Retweet value. What was the right number of Retweets? If you look at the tweets surrounding #machinelearning, #ai, and #datascience you’ll notice that a large amount retweets happen from a small handful of Twitterati’. Not to pick on @KirkDBorne but when he Tweets something, bots and people Retweet it like crazy.

There’s a large percentage of the tweets he sends that links back to content that’s been posted or generated elsewhere. He happens to have a large following that Retweets his stuff like crazy. His Retweets can range in the 100′s, so does this mean those Tweets are Important’ or a lot of noise? If some Tweet only has 10 Retweets but it’s a great link, does that mean it’s Not Important? So what’s the right number of retweets? One? Ten? One Hundred? There was no good answer here because I didn’t know what the right number was.

Labeling based on Binning

My next thought was to bin the tweets based on their Retweets into two buckets. Bucket one would be Not Important” and bucket two would be Important.” When I did this, I started getting a distribution that look better. It wasn’t till I examined the buckets that I realized that this method gleaned over a lot of good tweets.

In essence I was repeating the same mistakes as labeling based on Retweets. So if I trained a model on this, I’d still get shit.

Labeling based on a One Class SVM

I realized after trying the above two methods that there was no easy to do it. I wanted to find a lazy way of autolabeling but soon came back what is important, the training set.

The power and accuracy of any classification model depends on how good its training set is. Never overlook this!

The solution was to use a One Class SVM process in RapidMiner. I would get a set of 100 to 200 Tweets, read through them, and then ONLY label the Important’ ones. What were the Important’ ones? Any Tweet that I thought was interesting to me and my followers.

After I marked the Important Tweets, I imported that data set into RapidMiner and built my process. The process is simple.

The top process branch loads the hand labeled data, does some Text Processing on it, and feeds it into a SVM set with a One-Class kernel. Now the following is important!

The use a One Class SVM in RapidMiner, you have to train it only on one class, Important’ being that class. When you apply the model to out of sample (OOS) data, it will generate an inside’ and outside’ prediction with confidence values. These values show how close the new data point is inside the Important’ class (meaning it’s Important) or outside of that class. I end up renaming the inside’ and outside’ predictions to Important’ and Not Important’ .

The bottom process takes the OOS data, text processes it, and applys the model for prediction. At the end I do some cleanup where I merge the Tweets together so I can feed it into my Twitter Content model and find my Important words and actually build a classification model now!

Within a few seconds, I had an autolabeled data set! YAH!

Caution

While this process is a GREAT first start, there is more work to do. For example, I selected an RBF kernel and a gammas of 0.001 as a starting point. This was a guess and I need to put together and optimization process to help me figure out that right parameters to use to get a better autolabeling model. I’m also interested in using @mschmitz_′s LIME operator to help me understand the potential outliers when using this autolabeling method.

The Process

As I noted above, this process is a work in proces’ so use with caution. It’s a great blueprint because applying One Class SVMs in RapidMiner is easy but sometimes confusing.

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="open_file" compatibility="7.6.001" expanded="true" height="68" name="Open File" width="90" x="45" y="136">
        <parameter key="resource_type" value="URL"/>
        <parameter key="url" value="http://www.neuralmarkettrends.com/public/2017/tempfile.xlsx"/>
      </operator>
      <operator activated="true" class="read_excel" compatibility="7.6.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="238">
        <parameter key="connection" value="Twitter - Studio Connection"/>
        <parameter key="query" value="machinelearning &amp;&amp; ai &amp;&amp; datascience"/>
        <parameter key="result_type" value="recent"/>
        <parameter key="limit" value="3000"/>
        <parameter key="language" value="en"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select label and text 2" width="90" x="179" y="238">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Text|label|Retweet-Count"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply" width="90" x="313" y="289"/>
      <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text 2" width="90" x="447" y="187">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter missing labels" width="90" x="179" y="34">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="label.is_not_missing."/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select label and text" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Text|label|Retweet-Count"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text" width="90" x="447" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Label" width="90" x="581" y="34">
        <parameter key="attribute_name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="715" y="34">
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prune_below_percent" value="5.0"/>
        <parameter key="prune_above_percent" value="50.0"/>
        <parameter key="prune_below_absolute" value="100"/>
        <parameter key="prune_above_absolute" value="500"/>
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="Text" value="2.0"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace Tokens (2)" width="90" x="45" y="34">
            <list key="replace_dictionary">
              <parameter key="http.*" value="link"/>
            </list>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
            <parameter key="mode" value="specify characters"/>
            <parameter key="characters" value=" .!;:[,"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="581" y="34"/>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="715" y="34">
            <parameter key="string" value="link"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="849" y="34"/>
          <connect from_port="document" to_op="Replace Tokens (2)" to_port="document"/>
          <connect from_op="Replace Tokens (2)" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Missing Att" width="90" x="849" y="34">
        <parameter key="condition_class" value="no_missing_attributes"/>
        <list key="filters_list"/>
      </operator>
      <operator activated="true" class="support_vector_machine_libsvm" compatibility="7.6.001" expanded="true" height="82" name="SVM" width="90" x="983" y="34">
        <parameter key="svm_type" value="one-class"/>
        <parameter key="gamma" value="0.001"/>
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="715" y="187">
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prune_below_percent" value="5.0"/>
        <parameter key="prune_above_percent" value="50.0"/>
        <parameter key="prune_below_absolute" value="100"/>
        <parameter key="prune_above_absolute" value="500"/>
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="Text" value="2.0"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace Tokens (3)" width="90" x="45" y="34">
            <list key="replace_dictionary">
              <parameter key="http.*" value="link"/>
            </list>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="179" y="34">
            <parameter key="mode" value="specify characters"/>
            <parameter key="characters" value=" .!;:[,"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="313" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="447" y="34"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (2)" width="90" x="581" y="34"/>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="715" y="34">
            <parameter key="string" value="link"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="849" y="34"/>
          <connect from_port="document" to_op="Replace Tokens (3)" to_port="document"/>
          <connect from_op="Replace Tokens (3)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
          <connect from_op="Generate n-Grams (2)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
          <connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Missing att 2" width="90" x="849" y="187">
        <parameter key="condition_class" value="no_missing_attributes"/>
        <list key="filters_list"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="1117" y="34">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="join" compatibility="7.6.001" expanded="true" height="82" name="Join" width="90" x="1184" y="289">
        <list key="key_attributes"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Final Set" width="90" x="1318" y="289">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="prediction(label)|Text|Retweet-Count"/>
      </operator>
      <operator activated="true" class="store" compatibility="7.6.001" expanded="true" height="68" name="Store" width="90" x="1452" y="289">
        <parameter key="repository_entry" value="../data/Twitter Content Enriched Data Set"/>
      </operator>
      <connect from_op="Open File" from_port="file" to_op="Read Excel" to_port="file"/>
      <connect from_op="Read Excel" from_port="output" to_op="Filter missing labels" to_port="example set input"/>
      <connect from_op="Search Twitter" from_port="output" to_op="Select label and text 2" to_port="example set input"/>
      <connect from_op="Select label and text 2" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Nominal to Text 2" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
      <connect from_op="Nominal to Text 2" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Filter missing labels" from_port="example set output" to_op="Select label and text" to_port="example set input"/>
      <connect from_op="Select label and text" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Set Label" to_port="example set input"/>
      <connect from_op="Set Label" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Filter Missing Att" to_port="example set input"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
      <connect from_op="Filter Missing Att" from_port="example set output" to_op="SVM" to_port="training set"/>
      <connect from_op="SVM" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Filter Missing att 2" to_port="example set input"/>
      <connect from_op="Filter Missing att 2" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Join" to_port="left"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 1"/>
      <connect from_op="Join" from_port="join" to_op="Select Final Set" to_port="example set input"/>
      <connect from_op="Select Final Set" from_port="example set output" to_op="Store" to_port="input"/>
      <connect from_op="Store" from_port="through" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Posted by Thomas Ott /

Don't forget to sign up for our monthly newsletter on Data Science and RapidMiner here!


Product Marketing RapidMiner Tutorials One-Class SVM LIME tutorials


Previous post
A Plethora of Data Set Repositories - Data Science Central
Next post
Find best hotel for vacation with Sentiment Analysis - Big Data News