Extract Ernst Hemingway Quotes from Goodreads

Here’s a fast and simple process to extract Ernst Hemingway Quotes from Goodreads. The process is not done, I still need to loop over each quote and add 1 day to the %{now} macro. The goal is to then write them in markdown with %{now}+1 day and auto schedule them on my other website (thomasott.io).

Right now the Goodreads.com web structure is easy to extract but I suspect they’ll make it harder one day.


Use RapidMiner to Discover Twitter Content

Welcome to this new tutorial on how to use RapidMiner to discover Twitter Content. I created this process as a way to monitor what’s going on in the Twitter universe and see what topics are being tweeted about. Could I do this in Python scripts? Yes, but that would be a big waste of time for me. RapidMiner makes complex ETL and tasks simple, so I live and breathe it.

Why this process?

Back when I was in Product Marketing, I had to come up with many different blog posts and ‘collateral’ to help push the RapidMiner cause. I monitor what goes on on KD Nuggets, DataScience Central, and of course Twitter. I thought, it would be fun to extract key terms and subjects from Twitter (and later websites) to see what’s currently popular and help make a ‘bigger splash’ when we publish something new.

I’ve since applied this model to my new website Yeast Head to see what beer brewing lifestyle bloggers are posting about. The short end of that discussion is that the terms ‘#recipies’ and ‘#homebrew_#recipes’ are most popular. So I need to make sure to include some recipies going forward.  Interestingly enough, there’s a lot of retweets with respect to Homebrewer’s Association, so I’ll be exploiting that for sure.

The Process Design

This process utilizes RapidMiner’s text processing extension, X-means clustering, association rules, and a bunch of averaged attribute weighting schemes.  Since I’m not scoring any incoming tweets (this will be a later task) to see if any new tweets are important/not important, I didn’t do any classification analysis. I did create a temporary label called “Important/Not Important” based on a simple rule that if Retweets > 10, then it has to be important.

This is a problem because I don’t know what the actual retweet number threshold is for important (aka viral tweets) and my attribute weight chart (as above) will be a bit suspect, but it’s a start I suppose.

The Process

For this particular process I shared, I used a Macro to set the search terms to #machinelearning, #datascience, and #ai. When you run this process over and over, you’ll see some interesting Tweeters emerge.

Next Steps

My next steps are to figure out the actual retweet # that truly indicates whether a tweet is important and viral and what is not. I might write a one class auto-labeling process or just hand label some important and non-important tweets. That will hone down the process and let me really figure out what his the best number to watch.

Mashing Up Julia Language with RapidMiner

If you want to execute any Python in RapidMiner, you have to use the Execute Python operator. This operator makes things so simple that people use the hell out o fit. However, it wasn’t so simple in the “old days.” It could still be done but it required more effort, and that’s what I did with the Julia Language. I mashed up the Julia Language with RapidMiner with only a few extra steps.

The way we mashed up other programs and languages in the old days was to use the Execute Program operator. That operator let’s you execute arbitrary programs in RapidMiner within the process. Want to kick of some Java at run time? You could do it. Want to use Python? You could do that (and I did) too!

The best part? You can still use this operator today and that’s what I did with Julia. Mind you, this is tutorial is a simple Proof of Concept, I didn’t do anything fancy, but it works.

What I did was take some a RapidMiner sample data set (Golf) and pass it to a Julia script that writes it out as a CSV file. I save the CSV file to a working directory defined by the Julia script.

Tutorial Processes

A few prerequisites, you’ll need RapidMiner and Julia installed. Make sure your Julia path is correct in your environment variables. I had some trouble in Windows with this but it worked fine after I fixed it.

Below you’ll find the XML for the Rapidminer process and the simple Julia script. I named the script read.jl and called from my Dropbox, you’ll need to repath this on your computer.

The RapidMiner Process


The Julia Language script

Note: You’ll need to “Pkg.add(“Dataframes”)” to Julia first.

Of course the next steps is to write a more defined Julia script, pass the data back INTO RapidMiner, and then continue processing it downstream.