Matrix Factorization for Missing Value Imputation

I stumbled across an interested reddit post about using matrix factorization (MF) for imputing missing values.

The original poster was trying to solve a complex time series that had missing values. The solution was to use matrix factorization to impute those missing values.

Since I never heard of that application before, I got curious and searched the web for information. I came across this post using matrix factorization and Python to impute missing values.

In a nutshell:

Recommendations can be generated by a wide range of algorithms. While user-based or item-based collaborative filtering methods are simple and intuitive, matrix factorization techniques are usually more effective because they allow us to discover the latent features underlying the interactions between users and items. Of course, matrix factorization is simply a mathematical tool for playing around with matrices, and is therefore applicable in many scenarios where one would like to find out something hidden under the data.

The author uses a movie rating example, where you have users and different ratings for movies. Of course, a table like this will have many missing ratings. When you look at the table, it looks just like a matrix that’s waiting to be solved!

In a recommendation system such as Netflix or MovieLens, there is a group of users and a set of items (movies for the above two systems). Given that each users have rated some items in the system, we would like to predict how the users would rate the items that they have not yet rated, such that we can make recommendations to the users. In this case, all the information we have about the existing ratings can be represented in a matrix. Assume now we have 5 users and 10 items, and ratings are integers ranging from 1 to 5, the matrix may look something like this (a hyphen means that the user has not yet rated the movie):

Matrix Factorization of Movie Ratings

After applying MF, you get these imputed results:

Matrix Factorization of Movie Ratings Results

Of course I skipped over the discussion of Regularization and the Python Code, but you can read about that here.

Going back to the original Reddit post, I was intriqued how this imputation method is available in H2O.ai’s open source offering. It’s called ‘Generalized Low Ranked Models‘ and not only helps with dimensionality reduction BUT it also imputes missing values. I must check out more because I know there’s a better way than just replacing the average value.

The Fallacy of Twitter Bots

I’m going to be the first to admit that I use Python to send out Tweets to my followers. I have a few scripts that parse RSS feeds and do retweets on an hourly basis. They work fine but they do get ‘gamed’ occasionally. That’s the problem with automation, isn’t it? Getting gamed can cause all kinds of havoc for your brand and reputation, so you have to be careful.

Has this happened to me? Not really, but there has been a few embarrassing retweets and silly parsed advertisements in lieu of good articles.

Why bother with Twitter automation in the first place? Simple, everyone wants to be an ‘influencer’, myself included. Yet using automated methods to gain ‘eyeballs’ comes with a price. You end up sacrificing quality for quantity. You end up diluting your brand and losing the signal. In the end you get nothing but noise!

Signal vs Noise

At one time I tested/used @randal_olson‘s TwitterFollowBot to increase my follower count. It worked well and I started growing my followers in large clips. The script is pretty simple in logic, it follows people based on a certain hashtag (or followers of a Twitter handle) that you supply and does in about 100 people per run.

The goal here is to get a ‘follow back’ from the people you just followed, then auto mute them. If, after a week or so, they don’t follow you back you run another routine that ‘unfollows’ them and puts them on a black list not to ‘autofollow’ them again.

You run this script every few hours for a week and MY GAWD, does your following list explode! The noise becomes unbearable, even after muting them. You end up with cranks, conspiracy theorists, crypto-currency shills, and bots (most liked Russian bots). Yes, you do get a lot of follow backs but the quality signal of people you should really follow and interact with gets completely lost!

I stopped that experiment a few months ago and started unfollowing the noise. My following count is now below 1,000 but I feel that’s too much. I want to get that number to about 500. Of course, this resulted in my follower count dropping too. There’s a lot of Twitter users that also run ‘you unfollow me so I unfollow you’ scripts too. LOL.

Possible solutions

Just stop it. Stop all the Retweeting, TwitterBot following, and parsing. Instead do one or more of the following:

  1. Create a curated list of great links that you filter through. I know that @maoxian has done this over the years and it’s invaluable because he puts the time and effort in to filtering out the noise.
  2. Write a Python script to parse RSS feeds but write the links to a file so you can review later and tweet accordingly (more signal, less noise)
  3. Write a Python script to find ‘true’ influencers on Twitter and interact with them personally. Perhaps create a ranking system
  4. Something else that I’ll remember after I post this article

I guess lesson here is that we can’t automate the human touch. You can do a lot of the heavy lifting but in the end, it’s us that bring meaning and value to everything we do.

Extract Blog Post Links from RSS feeds

As part of my goal of automation here, I wrote a small script to extract blog post links from RSS feeds. using Python. I did this to extract the title and link of blog posts from a particular date range in my RSS feed. In theory, it should be pretty easy but I’ve come to find that time was not my friend.

What tripped me up was how some functions in python handle time objects. Read on to learn more!

What it does

What this script does is first scrape my RSS feed, then use a 7 day date range to extract the posted blog titles and links, and then writes it to a markdown file. Super simple, and you’ll need the feedparser library installed.

The real trick her is not the loop, but the timetuple(). This is where I first got tripped up.

I first created a variable for today’s date and another variable for 7 days before, like so:

The output of today becomes this: datetime.date(2018, 9, 8)
The output of week_ago becomes this: datetime.date(2018, 9, 1)

So far so good! The idea was to use a logic function like if post.date >= week_ago AND post.date <= today, then extract stuff.

So I parsed my feed and then using the built in time parsing features of feedparser, I wrote my logic function.

BOOM, it didn’t work. After sleuthing the problem I found that the dates extracted in feedparser were a timestruct object whereas my variables today and week_ago were datetime objects.

Enter timetuple() to the rescue. timetuple() changed the datetime object into a timestruct object by just doing this:

After that, it was straightforward to do the loop and write out the results, see below.

Python Script