This afternoon I took the first of three free webinars run by Michelle Leder of footnoted. I’ve poked around in the filings on the SEC website, but found all the documents to be very onerous and lengthy. The short of it was that my eyes glazed over in horror when I saw them Now, after Michelle’s class, I have a clearer picture on what to look for in a 10-K annual report and fell like the research task isn’t that bad after all.
The trick is how to leverage the unstructured information into a text mining model because the changes in filings can really be subtle from one period to another. The language changes from one industry to the next and any model that needs to be built is going to be sector/industry specific. From what I learned this afternoon, building a model is going to be trickier that I first assumed, so I’ll have to continue to brainstorm this.
I’ve been traveling a lot lately and managed to catch up on a bit of reading when I’m crusing at 30,000 feet. On my nook right now is a fascinating book that all text miners should at least browse in a book store. It’s called “The Secret Life of Pronouns,” by James Pennebaker.
The premise of the book is that your social status, sex, personality, and secret intentions can be determined by analyzing pronouns (I, you, they), artciles (a, an, the), and few other functional words. In the beginning of his research, James used the Liguisitic Inquiry and Word Count (LIWC) program but appears to have modified it with proprietary word dictionaries.
From the surface, LIWC looks similar to the word frequency routine that Rapidminer does in the Process Documents operator, but they went further and added a bit more “intelligence” to the analysis. What they did was roll out a fun servce called Analyze Words. You just enter your Twitter handle, click the button, and it gives you a snapshot into your tweet sentiment.
So how does this work? I suspect that James and team use their dictionaries to categorize incoming text documents and test against them and for the author’s sex, social status, personality, and sentiment. I’m sure that a lot of “up front” and hard work was done to build these dictionaries. A lot of “up front” work is the norm with text mining and if you try using shortcuts, you’ll likely get crappy models.
I think a model like his can be done quite easily in RapidMiner, especially if you build a good crawling and sentiment system to test against. All that it requires is a bit of thought and the will to do it. Most likely this is written in Python but it would be fun to replicate this. Isn’t the data-driven world we live in, cool?