Keras and NLTK

Lately I’ve been doing a lot more Python hacking, especially around text mining and using the deep learning library Keras and NLTK. Normally I’d do most of my work in RapidMiner but I wanted to do some grunt work and learn something along the way.  It was really about educating myself on Recurrent Neural Networks (RNN) and doing it the hard way I guess.

Keras and NLTK

As usually I went to google to do some sleuthing about how to text mine using an LSTM implementation of Keras and boy did I find some goodies.

The best tutorials are easy to understand and follow along. My introduction to Deep Learning with Keras was via Jason’s excellent tutorial called Text Generation with LSTM Recurrent Neural Networks in Python with Keras.

Jason took a every easy to bite approach to implementing Keras to read in the Alice In Wonderland book character by character and then try to generate some text in the ‘style’ of what was written before. It was a great Proof of Concept but fraught with some strange results. He acknowledges that and offers some additional guidance at the end of the tutorial, mainly removing punctuation and more training epochs.

The text processing is one thing but the model optimization is another. Since I have a crappy laptop I can just forget about optimizing a Keras script, so I went the text process route and used NLTK.

Now that I’ve been around the text mining/processing block a bunch of times, the NLTK python library makes more sense in this application. I much prefer using the RapidMiner Text Processing implementation for 90% of what I do with text but every so often you need something special and atypical.

Initial Results

The first results were terrible as my tweet can attest too!

So I added a short function to Jason’s script that preprocesses a new file loaded with haikus. I removed all punctuation and stop words with the express goal of generating haiku.

While this script was learning I started to dig around the Internet for some other interesting and related posts on LSTM’s, NLTK and text generation until I found Click-O-Tron.  That cracked me up. Leave it to us humans to take some cool piece of technology and implement it for lulz.

Implementation

I have grandiose dreams of using this script so I would need to put it in production one day. This is where everything got to be a pain in the ass. My first thought was to run the training on  a smaller machine and then use the trained weights to autogenerate new haikus in a separate scripts. This is not an atypical type of implementation. Right now I don’t care if this will take days to train.

While Python is great in many ways, dealing with libraries on one machine might be different on another machine and hardware. Especially when dealing with GPU’s and stuff like that.  It’s gets tricky and annoying considering I work on many different workstations these days. I have a crappy little ACER laptop that I use to cron python scripts for my Twitter related work, which also happens to be an AMD processor.

I do most of my ‘hacking’ on larger laptop that happens to have an Intel processor. To transfer my scripts from one machine to another I have to always make sure that every single Python package is installed on each machine. PITA!

Despite these annoyances, I ended up learning A LOT about Deep Learning architecture, their application, and short comings. In the end, it’s another tool in a Data Science toolkit, just don’t expect it to be a miracle savior.

Additional reading list

  • http://h6o6.com/2013/03/using-python-and-the-nltk-to-find-haikus-in-the-public-twitter-stream/
  • https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py

The Python Script

 

 

 

Using Python with RapidMiner

Just a quick note, I recently recorded a new video on how to use Python with RapidMiner Studio. This was part of the “Everyday Data Science with Tom Ott” series and you can check out more of my RapidMiner videos here.

Using Python with RapidMiner

In this video I show you how to use the Twython package and RapidMiner’s Text Mining extension to load Twitter tweets, text process them, and post a retweet based on your text processed data!

Update: This video and process was made prior to the introduction of the native Twitter operators, but the original strategy remains.

Sentiment Analysis in RapidMiner

This year, in late October, Rapidminer released an update to their Wordnet Extension. Granted, it’s December and I’m just getting around to playing with it now, but this particular update incorporated the SentiWordnet dictionary as a new operator. Here’s how to do Sentiment Analysis in RapidMiner.

SentiWordnet

This is really cool because it gives Rapidminer a deeper access to the world of Opinion Mining. Rapidminer has always been able to do Sentiment Analysis from a statistical approach but accessing the realm of Opinion Mining always required you to call processes (i.e. Sentiwordnet) outside of Rapidminer and integrate them into the workflow.

So how do you use it? First you need to have the latest Wordnet Extension installed and download the latest Wordnet (3.0.0) dictionary onto your machine. When you process your text files, it will generate a new sentiment column with numerical attributes ranging from -1 to +1. What these values mean can best explained here.

Sentiment Analysis in RapidMiner

Below are some screenshots of a simple Twitter process searching for iPhone6 tweets and generates sentiment based on the tweet itself.

Overall Process

2014-12-05-Senti_Image12014-12-05-Senti_Image1

The New SentiWordnet Operator

2014-12-05-Senti_Image22014-12-05-Senti_Image2

Results

2014-12-05-Senti_Image32014-12-05-Senti_Image3

Bear in mind, the results I show here are very rudimentary and Tweets are generally very messy. What you would need to do is evaluate the results and apply the appropriate ETL (i.e. remove Tweetbots) or modeling in another process, but you get the idea here. For another great look at the new SentiWordnet operator, check out this post by Andrew.

Pro Tip: Check out the Tutorials page for more great RapidMiner tutorials.