April 30, 2009

Data Mining Social Networks

All the stuff you post about yourself and what you like in Facebook or some other social network is a marketer’s wet dream.  Data mining companies are now capitalizing on the free information you post about yourself, mining it, and then selling statistically significant data relationships to marketers via the social network’s APIs.

A company called Colligent mines social networks for data that it sells to record labels to help them decide which demographics or individual fans might like a particular artist, and those are just the very first nuggets marketers pull out of profiles.

This monitoring of publicly-available data has already paid dividends. Disney’s Hollywood Records label had noticed more Latin American fans at Jonas Brothers concerts than it expected to see, but until Colligent’s data revealed a “statistically significant” correlation between that band and the Latin American community, it hadn’t capitalized on that observation. Data from social networks convinced them to increase their marketing budget in Latin American communities, and when the next Jonas Brothers album came out, Nagarajan says, the label saw a significant uptick in sales to Latin Americans.

There’s a lesson here: If you want to participate in social networks and interact with free content online, there’s a clear privacy trade-off. In a way, it’s a fair deal: we get free data in the form of social networks and free entertainment, while marketers get free data about who we are — and what we can’t resist. [By Eliot Van Buskirk]

The best piece of advice is NOT to use social networks if you want to maintain your privacy.

April 10, 2009

Using ClassifierXL to Find the Right Stock to Buy

I recently downloaded the new version of TraderXL and was surprised to see a major update to the ClassifierXL module (as part of the NeuroXL suite). I’ve used this module before to classify like groups of stocks and identify (per my requirements) the right stock to buy out of a group of many. Major updates to the module include a better GUI interface and the inclusion of five neural net functions, namely the Threshold, Hyperbolic Tangent, Zero-based Log-sigmoid, Log-sigmoid and Bipolar Sigmoid functions. classifierxl-1 To see what it can do, I’m attaching a recently classified ADR stock scan spreadsheet from www.aaii.com.I downloaded this scan from AAII, used the zero-based log-sigmoid scan, and classified the stocks into 5 similar groupings.After it crunched the data it created two charts and a color coded spreadsheet from your data.If you flip to the charts in the spreadsheet, you’ll notice that cluster 1 and 5 have large groupings of similar stocks.These clusters represent the most interesting of the stock groups and should clue in the data modeler to some possible opportunities in the data. Let’s say you are interested in investing in a China based company and you have lots of data from a stock scan to go through. How can you identify a good candidate for more due diligence? First open the spreadsheet and then using the pull down data sorting menus to select China as your country of choice. classifierxl-2 The data in the spreadsheet will sort and show 7 China based stocks, with 5 being in Cluster 1 and 2 being in Cluster 5. Now this is interesting data revelation to me because not all of these 7 China based stocks are being classified as the same. If you further drill down the data by selecting the Top 10 EPS Growth Estimate, then you are left with 4 China based stocks in Cluster 1: LFC, JOBS, BIDU, and MR. These 4 companies should give you a good smaller list of stocks for further review. classifierxl-3 Granted, this example was a fast way of doing a complex data analysis but the ClassifierXL module helped simplify the process. The neat thing about this module is that it does all the heavy lifting for you and organizes the data in an easy to use spreadsheet!

April 8, 2009

Welcome Back

Hi all! The stresses in my personal and work life are subsiding now and I’m happy to say that’ll I be posting again at some level of frequency. I was lucky to be extremely busy with infrastructure work in the past few months which turned out to be a blessing in disguise. Why? It kept me out of this insane market! So I’ve been busy updating my models and getting ready to do some “light” trading.

Before I go back to updating my models I want to share with you a piece of exciting news, the developers of Stock Neuromaster have fixed the signal “flip flop” issue in version 1.33! For all those that stayed away from using this product because of the comments you read on my site, please feel free to try it out now and see for yourself.

I’m glad to be back!

March 6, 2008

A Top Down and Bottoms Up Approach to Neural Net Models

I was inspired to write this post after I read Foquant’s (formerly CPP Trader) post on Inductive versus Deductive Algorithms. He hits the nail on the head when he ends his article with:

Personally, I have used deductive reasoning to develop frameworks for money management methods, and then tested data in a similar, inductive method, to create the details.

There’s a reason why I like Foquant’s blog, he an I think the same way when we approach model building but he just likes to use fancy terms! :)

Because I’ve been in the corporate world too long, I prefer to use the terms “top down” and “bottoms up,” when building a financial/neural net model. However you can easily replace those terms with “deductive reasoning” and “inductive reasoning” respectively.

The bottoms up approach is where you have oodles of data and you spend time cluster mining for relationships or for statistically significant patterns. Over time you start building a model that will lead to your output variable. This method is very rigorous and time consuming method but the final model should be very robust.

The other approach, top down, is where you a lot of time observing your output variable (i.e. stock, currency, index) behaving in the market environment and try to figure out what makes it work. Once you think you have an idea on what makes your output variable function, you gather the appropriate input variables and then statistically try to prove their relevance. Of course if you find that your inputs aren’t as robust as you like them to be, you’ll have to spend additional time looking for the right ones.

Both approaches have their strengths and weaknesses and figuring out what approach to use to build a model really depends on the individual. While one method tends to be more trial and error (top down), the other tends to be more hypothesis testing (bottom up).

Personally I start with the top down approach to build a model and then to check it using a bottoms up approach, just like Foquant. This is perhaps the most time consuming way of building a financial model but its led to great success for me and I continue to use it to this day!

March 2, 2008

New RapidMiner Tutorials Coming

RapidMinerYes you read that right, I’m working on a new set of RapidMiner tutorial posts (I’m bagging the videos for now). I hope to share with my readers two new tutorials over the coming weeks/months and I expect to post the first installment this week.

The first tutorial will be about using RapidMiner’s Evolutionary Weighting and Genetic Algorithms to build a Market Timing Model. This tutorial will follow the same methodology I used to build my S&P500 Market Timing Model. If you ever wondered how to build a timing model using RapidMiner then make sure to check back often for these free and valuable lessons.

After I’m done with that tutorial, I’ll delve into the world of Web and Text Mining. I’ll show you how to web mine blogs and websites for interesting bits of data. I’ll make sure that all new tutorials will have downloadable templates and data for you to play with.

On top of this, I’ll be over the next few months I’ll be updating my original YALE tutorials to be compatible with the new RapidMiner format. The formats are close but not close enough. I’m sure that many readers downloaded the YALE templates and then opened with the new RapidMiner software and found some “run time errors.”

February 12, 2008

Trimming Outliers In Rapidminer

RapidMiner OutlierI was inspired to write a short post about trimming outliers in RapidMiner after a comment from dc yesterday. Although I’ve never used these particular set of data pre-processing operators (I always inspect my data visually), I find them to interesting and worth a look.

If you right click and select “New Operator”, you’ll find many parent category operator selections. Choose the “Pre-Processing” category, then “Data”, and then “Outlier.”

Once in the outlier directory you’ll find three operators: densitybasedoutlierdetection, distancebasedoutlierdetection, LOFoutlierdetection.

Here’s what each of them do in brief:

  • The densitybasedoutlierdetection operator scans your data set and looks for outliers based on a density function (squared distance, euclidean distance, angle);
  • The distancebaseoutlierdetection operator uses a k-nearest neighbor algorithm to find outliers, and;
  • The LOFoutlierdectection operator uses minimal upper and lower bounds (with a density function) to find outliers.

These operators, in an experiment, will automatically “snip” your the outlier data record and then build your neural net model from the remaining data. Check out RapidMiner’s “Pre-Processing” category for more great data “cleaning” goodies!

August 9, 2007

Modeling Robust Data

I mentioned in my supervised learning post that your data model will only be as good as your training data. Too often, and I’ve been guilty of this as well, data modelers throw all kinds of inputs variables together into a training data set figuring the neural net learner needs them all. They click “run”, the learner magically creates a model, and your prediction sets start spitting out predictions.

Let me ask you this Mr./Ms. Modeler, how confident are you that model itself is significant? Would you be willing to bet your job on it? After all, we’ve all heard the saying, “garbage in, garbage out.” How do we separate the good data from the garbage data or prevent modeling garbage data in the first place?

To prevent embarrassment, or the loss your job, there are two main statistical measures that you can use to check your training data for significance. I do this almost every time before I run a neural net learner because throwing in unnecessary input variables slows down your analysis and takes up valuable memory resources. The two measures I’m talking about are the coefficient of determination (R2) and t-statistics.

Coefficient of determination (R2)

The coefficient of determination (R2) is the measure of variation in your output that is explained by your inputs. Taking my previous post’s example of plankton growth (PG), how do fluctuations in the measures of sea temperature (ST), sunlight intensity (SI), and whale population (WP) truly affect the output PG?

Running this statistical analysis can be an eye opener because you easily see if your inputs truly do drive your output. This simple test helped me determine that the original option volatility model I built for my client (recreating the model in a research paper at his request) was a piece of junk. The R2 measure for the original model was a mere 5%, meaning that the inputs were driving only 5% of the volatility prediction. My newer volatility model now is showing an R2 in the high 80% level. Just know this makes me feel confident that the model I’m building isn’t garbage.

For reference, a measure of 0% means your model is insignificant and 100% means your model is perfect (you’ll never get this high, nothings perfect).

Tip: If you don’t have a statistical package or the means to calculate R2, may I suggest using the LINEST function in Excel. This Excel function allows you calculate the R2 value and t-stats, which will be discussed next.

T-statistics

A t-statistic is nothing more than a measure of each input’s statistical significance to the output. It could be that all the variables ST, SI, and WP affect 80% of PG’s growth but when you test each variable, you might find that ST and SI are very significant but WP isn’t. In this case a measure of greater than 2 or less than -2 indicates significance. The greater the number (or lesser as the case may be), the more significantly that input variable is to explaining your output.

This measure is really good at identifying the weakest input from your training data. In some cases you can actually delete a few input variables from your training data without affecting your coefficient of determination. If that’s the case, then you just saved yourself some CPU time when you build your model.

Tip: Once again you can calculate T-statistics using Excel’s LINEST function (just read the help section for explanation).

There you have it, two good ways to check if you’re the data you are about to model is robust or weak. As always, if you have questions please drop me a comment.

Neural Net and Regression StatisticsUpdate: For clarification purposes as to why I’m using linear statistics to test data that will be used in a nonlinear model, I’m posting a screen shot from the book Data Mining and Business Intelligence by Stephan Kudyba and Richard Hoptroff.

Discussions with the author yielded the following important information: the coefficient of determination is the best indicator for data robustness and works for both linear and nonlinear models.  T-stats will be less reliable in a non-linear model but are important as a “check” for your overall model’s robustness.  If a low scoring t-stat input variable is removed and your R2 barely moves, then your model (linear and nonlinear) is very stable.  Conversely if you remove a low scoring t-stat input from your model and the R2 swings wildly, you have a very unstable model.

July 25, 2007

S&P500 Volatility Perspective

S&P500 14 Year Daily VolatilityI created a chart of the S&P500 daily % changes, squared them, and put them into an Excel chart. An interesting perspective was charted relative to the daily volatility and its spikes over the past 14 years. Most notable is that little spike we had on February 27, 2007. Something tells me that was a very significant day.

July 19, 2007

Using Cross Validation in YALE

Another important building block of any neural net model is the creation of training and validation data sets for your model. The data you feed your neural net model is typically called “training data” and you use it to train the neural net model to learn the relationships from this data. The question then arises is, how do you know if the neural net is being trained correctly? Is it learning the right data relationships?

The way to overcome this problem and test the model as the neural net learns is to introduce something called validation data sets. A validation data set is just a random sample from your training data that is taken and then applied to the model. Once the validation data is applied to the model, the model calculates a predicted value. This predicted value is then compared to the actual data value and the error between them is determined. The neural net does this for every validation data point, adjusting the weights (more on this later) in the model each time to minimize the validation error. When the errors converge or can’t be minimized any more, the model has been trained.

Yale XValidation OperatorYALE has a great operator called the Cross Validation operator that creates a validation data set on the fly for you. The Cross Validation operators allows you to tell just how much of your training data should be used for validation data and if you should use all the data (training and validation) to rebuild your final model.

Tip: YALE has a few other validation operators, explore them when you have time. This particular operator is useful if you want to check performance measures, which I suggest you do always.

Yale XValidation Operator2The way to use it is to load it right before you place your neural net operators (more on this later) and after your data loading operator. Once you have it in your experiment tree, you can then tweak the parameters to your liking. Some important parameters for this operator is the “Create Complete Model”, “Number of Validations”, and “Sampling Type.”

The Create Complete Model parameter tells the Cross Validation operator to create a validation data set for testing and simultaneously use it to build the model. If left unchecked, the operator will only use the validation data for testing and essentially remove it from the training data set.

The Number of Validations parameter is just the quantity of data points you want to use to test your model with. If you have 100 data points, I suggest using 10%, or 10 data points for validation. If you have 10,000 data points, maybe 20%, it all depends on your comfort level and the complexity of the data your modeling. The last important parameter is the Sampling Type, this pull down menu allows you to choose how to sample your training data for validation data points. You have three choices: linear, shuffled, and stratified sampling (more on this later).

That’s it, another important building block explained. I hope that these smaller, but more detailed, tutorials are helpful to you. If they are, how about subscribing to my feed? As always, feel free to drop me an email or comment if you have questions.

July 18, 2007

Finding Relationships in IM

A great blog that I read, and have recently added to my blog roll, is Data Mining: Text Mining, Visualization, and Social Media written by Matthew Hurst. Recently Matthew posted an interesting heatmap (below) that shows relationships of IM across the world. This map was a result of work detailed in a paper written by Jure Leskovec and Eric Horvitz called World Wide Buzz: Planetary-scale Views on an Instant Messaging Network.

IM Heat Map

To interpret the map, Matthew says, “A line is drawn between the locations of the participants in the conversations. When a line crosses a cell it increases its value (red).” I haven’t read the paper yet but its on my “must read pile.” Fascinating stuff alright!

July 14, 2007

RapidMiner 4.0Beta vs YALE 3.4

A few months ago YALE introduced its new version called RapidMiner 4.0beta. YALE is now called RapidMiner and for what it’s worth, I think its a better name.

I performed the upgrade and shortly thereafter I noticed that several of my existing models would produce different results, so I began to investigate. I found that there are some bugs in the beta version and have since downgraded back to YALE 3.4. I highly suggest my readers who use YALE to remain with version 3.4 until they are out of beta development!

Still though, RapidMiner and its former name is a fantastic open source data modeling environment!

July 12, 2007

Web and Text Mining for the Masses!

I installed the Word & Web Vector plugin for YALE (Rapidminer) this week and have been pleasantly surprised with it. However, with any YALE plugin or software, it takes a lot of time to figure out how to use it. Despite the large learning curve, I’ve been able to web mine a few websites and build a preliminary word list.

Now, no structured web data source is safe from the clutches of Neural Market Trends!

The Word & Web Vector Tool is a flexible Java library for statistical language modeling and integration of Web and Webservice based data sources. It supports the creation of word vector representations of text documents in the vector space model that is the point of departure for many text processing applications (e.g. text classification or information retrieval). Furthermore, it offers convenient interactive methods to extract data from structured sources, such was HTML or XML files. Finally, it allows to integrate external data by using Webservice APIs in a mashup-like way (e.g. for geo-mapping). [nemoz.org]

I’m looking forward to becoming the new Google! :)

Next Page »