Recently in RapidMiner Category

Python's NLTK Module

| No Comments
Whenever I have a few minutes I've been fooling around with Python's Natural Language Toolkit (NLTK) and have found it to be incredibly fascinating, powerful, and easy to use.  Mostly I've been following the examples from the NLTK Book as I learn to navigate around.  There is a section in the book about classifying text data, which I still need to dig through, but I found the section on "tagging" word data fascinating.  

The goal for me is to use Python to scrape the data together and then let Rapidminer mine the data.  For ease of use of powerful operators, Rapidminer wins hands down.  If only they can extend Rapidminer with Python, like what the did for R, then I'll be happy as a nerd in new data!

Looking for my Rapidminer Videos

| 2 Comments
Ever since I switched my CMS from Wordpress to Moveabletype, my Rapidminer video tutorial links have been broken.  I haven't had time to clean them up, much less create any new ones.

Fear not! They videos exist on Youtube and you can access them at my Neuralmarkettrends1 channel.

Sentiment Mining in Rapidminer

| No Comments
I came across a great slideshare presentation from RCOMM 2011 about how to use Rapidminer for sentiment mining. While I wasn't there for this presentation, you can get a good idea how Bruno Ohana and Brendan Tierney applied the various operators to the IMBD movie database.

The Secret Life of Pronouns

| No Comments
I've been traveling a lot lately and managed to catch up on a bit of reading when I'm crusing at 30,000 feet. On my nook right now is a fascinating book that all text miners should at least browse in a book store. It's called "The Secret Life of Pronouns," by  James Pennebaker. The premise of the book is that your social status, sex, personality, and secret intentions can be determined by analyzing pronouns (I, you, they), artciles (a, an, the), and few other functional words. In the beginning of his research, James used the Liguisitic Inquiry and Word Count (LIWC) program but appears to have modified it with proprietary word dictionaries. From the surface, LIWC looks similar to the word frequency routine that Rapidminer does in the Process Documents operator, but they went further and added a bit more "intelligence" to the analysis. What they did was roll out a fun servce called Analyze Words. You just enter your Twitter handle, click the button, and it gives you a snapshot into your tweet sentiment. So how does this work?  I suspect that James and team use their dictionaries to categorize incoming text documents and test against them and for the author's sex, social status, personality, and sentiment. I'm sure that a lot of "up front" and hard work was done to build these dictionaries.  A lot of "up front" work is the norm with text mining and if you try using short cuts, you'll likely get crappy models. I think a model like his can be done quite easily in Rapidminer, especially if you build a good crawling and sentiment system to test against. All that it requires is a bit of thought and the will to do it.  Isn't the data driven world we live in, cool?

Heavy Crunch Time

| 2 Comments
Hi Readers, It's been a while since my last post, but life has been ridicuously busy for me.  I have a mountain of questions and comments to answer from readers and I apologize that its taking so long.  On top of all that, I still need to start writing my chapter in the Rapidminer Book! I ask for your patience as I work this all out in the coming month. Thanks.

Rapidminer Sample Process: Multiply Data

| 1 Comment
I often use the Multiply operator to make copies of my data set and feed it into different learners. I do this because sometimes I don't know if a Neural Net operator, or a SVM operator, give me better performance.  Once I know which operator performs my task better, I then use the parameter optimization process to see if I can squeeze more accuracy out it. The sample process below uses the Iris data set, just switch it out with your data set and enjoy. Multiply Sample Process.txt

Rapidminer Sample Process: Financial Text Mining

| 4 Comments
This is the sample Rapidminer process I used in Video #14.  Just download the text file and import into into RM using the import process function.  Please note, you will need to create the Excel spreadsheet yourself, as I show you in the video.  Just save the Excel to a 2003 format and you're done. Enjoy! Financial Textmining.txt

Rapidminer Sample Process: Parameter Optimization

| 1 Comment
Below is a simple parameter optimization process in Rapidminer using the Iris data set.  Download the TXT file and import it into Rapidminer.  Of course, you may use whatever data set you want and switch out the learner.  Make sure to update the parameter optimization operator parameters. :) RM-Parameter Optimization.txt
So I finally got around to downloading some keyword data from Google Analytics for the time period of 2/17/11 through 3/17/11 just to see what's driving my site traffic.  I did a simple text mining process in Rapidminer to build my keyword frequency list (it took me a few minutes) and generated keyword similarities.  Of course I know what is the biggest draw to my site, that would be my tutorials about Rapidminer, BUT what I'm looking for are subtler patterns in the keywords relative to the bounce rates and site visits. So below are a few charts I generated from one month of keyword data. The first chart I want to share with you is a bubble chart showing the site visits for a particular keyword vs the bounce rates. In this case the keyword is Rapidmi (a stemmed word for Rapidminer) It's a bubble chart so the size of the bubbles are set for the frequency of the word Rapidminer relative to the site visit and bounce rate. The second chart is visits vs bounce rate but with the keyword Tutorial as the bubble size. And the last chart is visits vs bounce rate but with the keyword Stock as the bubble size.   It appears from the above exercise that the keyword Rapidminer and Tutorial drives a lot of traffic but they have a relatively even keyword frequency distribution across the bounce rate, some people bounce immediately while other stick. The keyword Stock has an interesting bounce rate per visit distribution relative to the keyword frequency, its either 100%, 30 to 50% or almost 0%. What I find fascinating is the stickiness of the keyword frequency Rapidminer and Tutorial relative to the 50% bounce rate and site visits. There's a strong site visit (45 to 60) component for those keywords in the data, but I knew that already. I'm attaching the Rapidminer process file in case you want to mine your own keywords (you have to supply your own data). KeywordSimilarity
I've been teaching myself R now that I finally got Rapidminer's R plugin to work.  It's  pretty slick program and easy to learn, I've picked up so many things quickly.  I extensively use the PerformanceAnalytics, Quantmod, and tseries packages for R and on top of that, I started to recreate A Physicist on Wall Street's awesome Rapidminer + R Example for Trading tutorial. So far so good. It's fantastic that I can now download stock quotes, using the R plugin, right into Rapidminer and then model those time series.  Yes the native R software has a few learning algorithms, but they in no way match Rapidminer's breadth and depth.  That, and with Rapidminer's ability to handle large data sets effeciently, and R's statistical analytic and graphing powers, makes the Rapidminer and R combination a disruptive technology in my book. Download it today, play with it, it will make your data shine in ways you can only dream of.

About this Archive

This page is an archive of recent entries in the RapidMiner category.

Real Estate is the previous category.

Python is the next category.

Find recent content on the main index or look in the archives to find all content.