Tag Machine Learning

Posts: 17

Orange 3 is impressive

I've been keeping a lazy eye on Orange over the years and it's (fairly) recent update has made it quite an impressive contender in the Data Science visual platform space. While it's not RapidMiner, it does have a lot of great things going for it. First, it's entire core was rewritten to tightly integrate with Scikit-Learn and Python. It has a decent time series 'add-on' which comes stock with ARIMA. It has a really good Text Processing 'add-on' that gives the user more finer control that RapidMiner's and it has a great GEO Map natively.

Sure, there is no production server or native Hadoop connectivity but all that can be solved by easily creating a new Widget using Python and calling some Orange classes and exporting Pickle files.

Testing out Orange3

I took a few minutes yesterday and this morning to build a simple Twitter Text Processing flow. Just like RapidMiner you have to connect 'Widgets' together. Each Widget does a specific task. In the process below I connect to Twitter, do a search for 'RapidMiner' and extract the corpus. I use the NLTK package to do my stopword filtering convert the text via TF-IDF. Whenever you connect to a widget, the process executes that widget, so you don't have hit 'play' all the time.

From there I do two more things, I create a word cloud and run hierarchal clustering on it.

There is a pretty rich ETL set of Widgets too and if you don't find what you need, you can just use the Python Script widget to write your own code.

Some of the negatives I've encountered is that it crashes when I try to install too many 'add-ons' and it doesn't feel stable enough on a Windows machine but overall, it's quite impressive. I'm going to continue to tinker around with this software and write about it.

comments

Machine Learning for Predicting the Unknown

Great interview with Courtenay Cotton of n-Join. Here are some key tibits I found interesting.

  • People develop new algorithms and have breakthroughs, but it’s always that you’re optimizing algorithms, you’re solving for functions.

  • Data cleaning and data wrangling, as the first step doing any of this stuff, is a giant part of this field. There’s almost never not errors in your data.

  • In the tech community about 10 years ago, there was a cliché — not always true — that everyone was a college dropout. But it seems like machine learning is really driven by academics.via Medium

  • There’s always an air of mystery because, in reality, even for us researchers, a lot of these algorithms are black boxes.

  • Some AI researchers are legitimately trying to figure out how you would get a machine that learned like a human child. But in general, most of the work is “I need this very specific thing that just does this one thing, and I’m going to throw all the data in the world that I can get my hands on at it.” At the end, it will be pretty good at that one thing—if we have the right data.

comments

Labeling Training Data Correctly

When you’re dealing with a classification problem in machine learning, good labeled data is crucial. The more time you spend labeling training data correctly, the better. This is because your model’s performance and deployment will depend on it. Always remember that garbage in means garbage out.

Thoughts on labeling data

I recently listened to a great O’Reilly podcast on this subject. They interviewed Lukas Biewald, Chief Data Scientist and Founder of CrowdFlower. CrowdFlower provides their clients with top notch labeled training data for various machine learning tasks, and they’re busy!

The few bits that caught my ear were how much of the training data is used in deep learning. They’re also seeing more image labeled data for self driving cars.

The best part of the interview as Lukas’s discussion on using a Raspberry Pi with Tensor Flow! How cool is that?

The Podcast

https://soundcloud.com/oreilly-radar/data-preparation-in-the-age-of-deep-learning?in=oreilly-radar/sets/the-oreilly-data-show-podcast


Originally published at Neural Market Trends.

comments

Machine Learning on a Raspberry Pi

It looks like Google is catching up to the idea of machine learning on a Raspberry Pi! Someone put RapidMiner on a Pi back in 2013 but it was slow because the Pi was underpowered.

The Pi has been a great thin client and a small, but capable server. I’ve used it for my Personal Weather Station project and as an FTP server. Based on the news, things are about to get interesting for both Google and Raspberry Pi.

I don’t know what Google is planning to release to the Pi and Maker community but based on the survey I filled out, they haven’t decided yet. They’re looking at C#, Javascript, Go, Swift, Python, and all the other usual suspects.

Raspberry Pi

The problem is optimizing the machine learning libraries for the Pi and having enough available to make it worthwhile for the community.  My guess is that they’ll go with Python, TensorFlow, and Go (Grumpy).

Whatever they decide, I consider this big news for Tinkerers and Makers everywhere. There will be an explosion of innovation if the Google toolkit is comprehensive. The Startup barrier to entry has been lowered, all you need is Pi ($40), a domain, some sweat equity, and a dream.

comments

Data Mining Social Networks

All the stuff you post about yourself and what you like in Facebook or some other social network is a marketer’s wet dream.  Data mining companies are now capitalizing on the free information you post about yourself, mining it, and then selling statistically significant data relationships to marketers via the social network’s APIs.

A company called Colligent mines social networks for data that it sells to record labels to help them decide which demographics or individual fans might like a particular artist, and those are just the very first nuggets marketers pull out of profiles.

This monitoring of publicly-available data has already paid dividends. Disney's Hollywood Records label had noticed more Latin American fans at Jonas Brothers concerts than it expected to see, but until Colligent's data revealed a statistically significant correlation between that band and the Latin American community, it hadn't capitalized on that observation. Data from social networks convinced them to increase their marketing budget in Latin American communities, and when the next Jonas Brothers album came out, Nagarajan says, the label saw a significant uptick in sales to Latin Americans.

There's a lesson here: If you want to participate in social networks and interact with free content online, there's a clear privacy trade-off. In a way, it's a fair deal: we get free data in the form of social networks and free entertainment, while marketers get free data about who we are — and what we can't resist. By: Eliot Van Buskirk

The best piece of advice is NOT to use social networks if you want to maintain your privacy.

comments