Data Cleaning with IMDb Ratings

My idea for this came after recently watching Mr. Robot. I thought it was one of the best shows I’ve seen in a long time, but when I looked at the IMDb rating it was only an 8.5, not bad obviously but still lower than I would have expected. So I will start this by saying, I do generally like IMDb ratings and use them when I decide on what shows to watch. Usually IMDb doesn’t let me down when it comes to highly rated shows. But every once in a while there is (what I would consider) a great show that doesn’t get the same level of respect on IMDb ratings as I think it should. I wanted to figure out why that is so I started with shows I thought were rated a little too high or too low.

Analyzing Data Without Programming

I started the simple way I would have a few months ago when I didn’t know how to program at all, by looking at the rating and the episode ratings on IMDb. I saw a slight theme with shows that started off more slowly and ramped up in excitement were rated lower while shows that started off strong got worse over time kept a very high score. For that I looked at the episode ratings over time, a show that ramped up was one where the episode ratings got consistently better over time and vice versa for the the shows that slumped. This made some sense to me. If a show is slower there are people who will likely stop watching and give it a lesser rating without finishing the whole show. And if a show is great early on people might be rating before the end, since seasons take a year or more to come out in most cases. But I could only see so much with just a few shows, it was time to move onto the programming I’ve learned to do this for a larger data set. My goal will be to see if the overall rating has a greater correlation to the first season episode ratings than it does to the average episode ratings of the whole show.

Plotting Data and Generating Correlations

I was able to find 2 separate csv files based on the top 250 shows, one with the overall rating and average episode rating and another with the average episode ratings for individual seasons. I started by importing those csv files as a pandas dataframe.

From there I had to take the dataframe with all the individual season episode ratings and clean the data a bit. First I filtered out so that I would only see the episode ratings for the 1st season since that was what I was looking for. Then I just wanted to change the name of Rating Mean to Season 1 Rating to differentiate when I merged the 2 dataframes later on, since the first set of data has a Rating Mean.

Next I merged the 2 data frames to see everything together. I was able to do this since all shows had their own unique code, so I could merge the data on those codes. I originally did it by name but ran into problems since multiple shows have the same name in some cases.

Now I had to clean up the newly merged dataframe since I had a ton of information there I was not interested in. This helped make it easier to visualize the data. So I dropped some unnecessary columns, and renamed the columns I was keeping to make it easier to read through the data. I also changed the ratings in the data to be rounded to 2 decimals to keep it similar to the overall show rating.

Now I’m ready to go. So I found the correlations to see if I was right with my initial idea.

After seeing the correlation I saw that I was technically right. But surprisingly, both the season 1 ratings and the overall episode ratings were barely correlated to the overall rating of the show. Since we had 250 rows of data I couldn’t just go look at the dataframe to try and figure out why this was so I decided to generate scatter plots for the 2 different correlations to see what the data looked like.

The scatterplots are all over the place but one thing I see is both the Season 1 ratings as well as the overall episode ratings have some lower outliers compared to the shows rating. In some cases the show rated over an 8.5 would have episode ratings in the 5’s or 6’s. One idea I had to see if this was normal was to check it by rating count. The idea behind this being that the variance will decrease as the total votes for the show increase. The more votes the less that outliers affect the overall output. So I obtained the correlation only for shows with at least 80000 votes. I got this by the average rating count with a slight buffer for incredible outliers like Game of Thrones.

After taking out any shows with a lower vote count we see that the correlation between the shows does in fact double so that was a good start to getting more information on the data.

More Analysis Options

After looking at the data as a whole I saw that my theory, while technically right with my data, was still more inconclusive than anything. Some other things that could be analyzed that might have an effect on the data could be comparing the data vs the year the show came out. Another thing that could affect our data are how the seasons are grouped for some shows, especially animated shows which are very often grouped into 1 season for all their episodes, which could be affecting the data since the first season and all the episode ratings would be the same. Another thing is that older shows generally had much more total votes for the overall rating than votes on each episode. That results in a really high variance for the individual episodes that can throw it off. Other things that I could do is try to just remove all the excessive outliers. Setting an outlier limit (maybe 2/3 standard deviations) in either directions and just assuming they are just that, outliers, for whatever reason and seeing if there is more of a correlation to the data afterwards. And the last and maybe most obvious is to just get a larger sample of data. While 250 is a good amount there are thousands of shows out there that we could compute data for and as I stated before the larger the sample size the less of a chance outliers can massively alter the data and the less of a variance the data will have.

So even though there was a small correlation and it surprised me, overall there are many possibilities to expand the data even further and to see if there is any correlation. Or maybe it will confirm what I found in this smaller set of data. But in the end the biggest piece of advice I can give is not to quit on a show that has a great premise just cuz of a slower opening season.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store