What works; What Doesn’t Work

The most important lesson I’ve learned while working at a Startup is to do more of what works and jettison what doesn’t work, quickly. That’s the way to success, the rest is just noise and a waste of time. This lesson can be applied to everything in life.

Data is your friend

We generate data all the time, whether it’s captured in a database or spreadsheet, just by being alive you throw of data points. The trick is to take notice of it, capture it, and then do something with it.

It’s the “do something with it” that matters to your success or not.  Your success can be anything that is of value to you. Time, money, weight loss, stock trading, whatever. You just need to start capturing data, evaluate it, and take action on it.

This is where you fail

Many people fail by taking no action on the data they captured and evaluated. They hope that things are going to get better or that things are going to change. Maybe they will, maybe they won’t but you must act on what the data is telling you now.

NOW!

My Examples, what Works/Doesn’t Work

  1. My $100 Forex experiment worked really well for a time, then it started to flag. The data was telling me that my trading method was no longer working. Did I listen? Nope. I blew up that account. This didn’t work for me.
  2. Writing RapidMiner Tutorials on this blog ended up getting me a job with a great job with RapidMiner. This lead to an amazing career in Data Science. Writing and taking an interest in things works.
  3. Day trading doesn’t work for me. I blow up all the time. What works for me is swing and trend trading. Do more of that and no day trading.

Keep it simple, stupid

The one thing I’ve also learned working at a startup is to keep things simple and stupid. You’re running so fast trying to make your quarter that you have no time for complex processes. Strip things down to their minimum and go as light as you can. This way you can adjust your strategy and make changes quickly, you can do more of what works and jettison what doesn’t.

 

 

Keras and NLTK

Lately I’ve been doing a lot more Python hacking, especially around text mining and using the deep learning library Keras and NLTK. Normally I’d do most of my work in RapidMiner but I wanted to do some grunt work and learn something along the way.  It was really about educating myself on Recurrent Neural Networks (RNN) and doing it the hard way I guess.

Keras and NLTK

As usually I went to google to do some sleuthing about how to text mine using an LSTM implementation of Keras and boy did I find some goodies.

The best tutorials are easy to understand and follow along. My introduction to Deep Learning with Keras was via Jason’s excellent tutorial called Text Generation with LSTM Recurrent Neural Networks in Python with Keras.

Jason took a every easy to bite approach to implementing Keras to read in the Alice In Wonderland book character by character and then try to generate some text in the ‘style’ of what was written before. It was a great Proof of Concept but fraught with some strange results. He acknowledges that and offers some additional guidance at the end of the tutorial, mainly removing punctuation and more training epochs.

The text processing is one thing but the model optimization is another. Since I have a crappy laptop I can just forget about optimizing a Keras script, so I went the text process route and used NLTK.

Now that I’ve been around the text mining/processing block a bunch of times, the NLTK python library makes more sense in this application. I much prefer using the RapidMiner Text Processing implementation for 90% of what I do with text but every so often you need something special and atypical.

Initial Results

The first results were terrible as my tweet can attest too!

So I added a short function to Jason’s script that preprocesses a new file loaded with haikus. I removed all punctuation and stop words with the express goal of generating haiku.

While this script was learning I started to dig around the Internet for some other interesting and related posts on LSTM’s, NLTK and text generation until I found Click-O-Tron.  That cracked me up. Leave it to us humans to take some cool piece of technology and implement it for lulz.

Implementation

I have grandiose dreams of using this script so I would need to put it in production one day. This is where everything got to be a pain in the ass. My first thought was to run the training on  a smaller machine and then use the trained weights to autogenerate new haikus in a separate scripts. This is not an atypical type of implementation. Right now I don’t care if this will take days to train.

While Python is great in many ways, dealing with libraries on one machine might be different on another machine and hardware. Especially when dealing with GPU’s and stuff like that.  It’s gets tricky and annoying considering I work on many different workstations these days. I have a crappy little ACER laptop that I use to cron python scripts for my Twitter related work, which also happens to be an AMD processor.

I do most of my ‘hacking’ on larger laptop that happens to have an Intel processor. To transfer my scripts from one machine to another I have to always make sure that every single Python package is installed on each machine. PITA!

Despite these annoyances, I ended up learning A LOT about Deep Learning architecture, their application, and short comings. In the end, it’s another tool in a Data Science toolkit, just don’t expect it to be a miracle savior.

Additional reading list

  • http://h6o6.com/2013/03/using-python-and-the-nltk-to-find-haikus-in-the-public-twitter-stream/
  • https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py

The Python Script

 

 

 

Use RapidMiner to Discover Twitter Content

Welcome to this new tutorial on how to use RapidMiner to discover Twitter Content. I created this process as a way to monitor what’s going on in the Twitter universe and see what topics are being tweeted about. Could I do this in Python scripts? Yes, but that would be a big waste of time for me. RapidMiner makes complex ETL and tasks simple, so I live and breathe it.

Why this process?

Back when I was in Product Marketing, I had to come up with many different blog posts and ‘collateral’ to help push the RapidMiner cause. I monitor what goes on on KD Nuggets, DataScience Central, and of course Twitter. I thought, it would be fun to extract key terms and subjects from Twitter (and later websites) to see what’s currently popular and help make a ‘bigger splash’ when we publish something new.

I’ve since applied this model to my new website Yeast Head to see what beer brewing lifestyle bloggers are posting about. The short end of that discussion is that the terms ‘#recipies’ and ‘#homebrew_#recipes’ are most popular. So I need to make sure to include some recipies going forward.  Interestingly enough, there’s a lot of retweets with respect to Homebrewer’s Association, so I’ll be exploiting that for sure.

The Process Design

This process utilizes RapidMiner’s text processing extension, X-means clustering, association rules, and a bunch of averaged attribute weighting schemes.  Since I’m not scoring any incoming tweets (this will be a later task) to see if any new tweets are important/not important, I didn’t do any classification analysis. I did create a temporary label called “Important/Not Important” based on a simple rule that if Retweets > 10, then it has to be important.

This is a problem because I don’t know what the actual retweet number threshold is for important (aka viral tweets) and my attribute weight chart (as above) will be a bit suspect, but it’s a start I suppose.

The Process

For this particular process I shared, I used a Macro to set the search terms to #machinelearning, #datascience, and #ai. When you run this process over and over, you’ll see some interesting Tweeters emerge.

Next Steps

My next steps are to figure out the actual retweet # that truly indicates whether a tweet is important and viral and what is not. I might write a one class auto-labeling process or just hand label some important and non-important tweets. That will hone down the process and let me really figure out what his the best number to watch.