Youtube
Table of Contents
1. Init
Let's start by importing the necessary modules.
from context import src from context import pd, np, sns, plt from src import utils, plotter
2. Analysis
In this section we analyse the youtube
dataset. We start by reading
the data docs which gives us some preliminary information regarding
the dataset and it's features.
The data is divided into several files based on the country. For this analysis we consider the data from the US only. This is because:
- The smells will not be very different in the data from other countries and
- The language is not english for (most of the) other countries.
2.1. Preliminary analysis
For the purpose of this analysis, the USvideos.csv
file was renamed
to youtube.csv
.
youtube = pd.read_csv(utils.data_path('youtube.csv')) youtube.head()
video_id trending_date \ 0 2kyS6SvSYSE 17.14.11 1 1ZAPwfrtAFY 17.14.11 2 5qpjK5DgCt4 17.14.11 3 puqaWrEC7tY 17.14.11 4 d380meD0W0M 17.14.11 title \ 0 WE WANT TO TALK ABOUT OUR MARRIAGE 1 The Trump Presidency: Last Week Tonight with John Oliver (HBO) 2 Racist Superman | Rudy Mancuso, King Bach & Lele Pons 3 Nickelback Lyrics: Real or Fake? 4 I Dare You: GOING BALD!? channel_title category_id publish_time \ 0 CaseyNeistat 22 2017-11-13T17:13:01.000Z 1 LastWeekTonight 24 2017-11-13T07:30:00.000Z 2 Rudy Mancuso 23 2017-11-12T19:05:24.000Z 3 Good Mythical Morning 24 2017-11-13T11:00:04.000Z 4 nigahiga 24 2017-11-12T18:01:41.000Z tags \ 0 SHANtell martin 1 last week tonight trump presidency|"last week tonight donald trump"|"john oliver trump"|"donald trump" 2 racist superman|"rudy"|"mancuso"|"king"|"bach"|"racist"|"superman"|"love"|"rudy mancuso poo bear black white official music video"|"iphone x by pineapple"|"lelepons"|"hannahstocking"|"rudymancuso"|"inanna"|"anwar"|"sarkis"|"shots"|"shotsstudios"|"alesso"|"anitta"|"brazil"|"Getting My Driver's License | Lele Pons" 3 rhett and link|"gmm"|"good mythical morning"|"rhett and link good mythical morning"|"good mythical morning rhett and link"|"mythical morning"|"Season 12"|"nickelback lyrics"|"nickelback lyrics real or fake"|"nickelback"|"nickelback songs"|"nickelback song"|"rhett link nickelback"|"gmm nickelback"|"lyrics (website category)"|"nickelback (musical group)"|"rock"|"music"|"lyrics"|"chad kroeger"|"canada"|"music (industry)"|"mythical"|"gmm challenge"|"comedy"|"funny"|"challenge" 4 ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"idy"|"rhpc"|"dares"|"no truth"|"comments"|"comedy"|"funny"|"stupid"|"fail" views likes dislikes comment_count \ 0 748374 57527 2966 15954 1 2418783 97185 6146 12703 2 3191434 146033 5339 8181 3 343168 10172 666 2146 4 2095731 132235 1989 17518 thumbnail_link comments_disabled \ 0 https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg False 1 https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg False 2 https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg False 3 https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg False 4 https://i.ytimg.com/vi/d380meD0W0M/default.jpg False ratings_disabled video_error_or_removed \ 0 False False 1 False False 2 False False 3 False False 4 False False description 0 SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\nCANDICE - https://www.lovebilly.com\n\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\nwith this lens -- http://amzn.to/2rUJOmD\nbig drone - http://tinyurl.com/h4ft3oy\nOTHER GEAR --- http://amzn.to/2o3GLX5\nSony CAMERA http://amzn.to/2nOBmnv\nOLD CAMERA; http://amzn.to/2o2cQBT\nMAIN LENS; http://amzn.to/2od5gBJ\nBIG SONY CAMERA; http://amzn.to/2nrdJRO\nBIG Canon CAMERA; http://tinyurl.com/jn4q4vz\nBENDY TRIPOD THING; http://tinyurl.com/gw3ylz2\nYOU NEED THIS FOR THE BENDY TRIPOD; http://tinyurl.com/j8mzzua\nWIDE LENS; http://tinyurl.com/jkfcm8t\nMORE EXPENSIVE WIDE LENS; http://tinyurl.com/zrdgtou\nSMALL CAMERA; http://tinyurl.com/hrrzhor\nMICROPHONE; http://tinyurl.com/zefm4jy\nOTHER MICROPHONE; http://tinyurl.com/jxgpj86\nOLD DRONE (cheaper but still great);http://tinyurl.com/zcfmnmd\n\nfollow me; on http://instagram.com/caseyneistat\non https://www.facebook.com/cneistat\non https://twitter.com/CaseyNeistat\n\namazing intro song by https://soundcloud.com/discoteeth\n\nad disclosure. THIS IS NOT AN AD. not selling or promoting anything. but samsung did produce the Shantell Video as a 'GALAXY PROJECT' which is an initiative that enables creators like Shantell and me to make projects we might otherwise not have the opportunity to make. hope that's clear. if not ask in the comments and i'll answer any specifics. 1 One year after the presidential election, John Oliver discusses what we've learned so far and enlists our catheter cowboy to teach Donald Trump what he hasn't.\n\nConnect with Last Week Tonight online...\n\nSubscribe to the Last Week Tonight YouTube channel for more almost news as it almost happens: www.youtube.com/user/LastWeekTonight\n\nFind Last Week Tonight on Facebook like your mom would: http://Facebook.com/LastWeekTonight\n\nFollow us on Twitter for news about jokes and jokes about news: http://Twitter.com/LastWeekTonight\n\nVisit our official site for all that other stuff at once: http://www.hbo.com/lastweektonight 2 WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_confirmation=1\n\nTHANKS FOR WATCHING! LIKE & SUBSCRIBE FOR MORE VIDEOS!\n-----------------------------------------------------------\nFIND ME ON: \nInstagram | http://instagram.com/rudymancuso\nTwitter | http://twitter.com/rudymancuso\nFacebook | http://facebook.com/rudymancuso\n\nCAST: \nRudy Mancuso | http://youtube.com/c/rudymancuso\nLele Pons | http://youtube.com/c/lelepons\nKing Bach | https://youtube.com/user/BachelorsPadTv\n\nVideo Effects: \nCaleb Natale | https://instagram.com/calebnatale\n\nPA:\nPaulina Gregory\n\n\nShots Studios Channels:\nAlesso | https://youtube.com/c/alesso\nAnitta | http://youtube.com/c/anitta\nAnwar Jibawi | http://youtube.com/c/anwar\nAwkward Puppets | http://youtube.com/c/awkwardpuppets\nHannah Stocking | http://youtube.com/c/hannahstocking\nInanna Sarkis | http://youtube.com/c/inanna\nLele Pons | http://youtube.com/c/lelepons\nMaejor | http://youtube.com/c/maejor\nMike Tyson | http://youtube.com/c/miketyson \nRudy Mancuso | http://youtube.com/c/rudymancuso\nShots Studios | http://youtube.com/c/shots\n\n#Rudy\n#RudyMancuso 3 Today we find out if Link is a Nickelback amateur or a secret Nickelback devotee. GMM #1218\nDon't miss an all new Ear Biscuits: https://goo.gl/xeZNQt\nWatch Part 4: https://youtu.be/MhCdiiB8CQg | Watch Part 2: https://youtu.be/7qiOrNao9fg\nWatch today's episode from the start: http://bit.ly/GMM1218\n\nPick up all of the official GMM merch only at https://mythical.store\n\nFollow Rhett & Link: \nInstagram: https://instagram.com/rhettandlink\nFacebook: https://facebook.com/rhettandlink\nTwitter: https://twitter.com/rhettandlink\nTumblr: https://rhettandlink.tumblr.com\nSnapchat: @realrhettlink\nWebsite: https://mythical.co/\n\nCheck Out Our Other Mythical Channels:\nGood Mythical MORE: https://youtube.com/goodmythicalmore\nRhett & Link: https://youtube.com/rhettandlink\nThis Is Mythical: https://youtube.com/thisismythical\nEar Biscuits: https://applepodcasts.com/earbiscuits\n\nWant to send us something? https://mythical.co/contact\nHave you made a Wheel of Mythicality intro video? Submit it here: https://bit.ly/GMMWheelIntro\n\nIntro Animation by Digital Twigs: https://www.digitaltwigs.com\nIntro & Outro Music by Jeff Zeigler & Sarah Schimeneck https://www.jeffzeigler.com\nWheel of Mythicality theme: https://www.royaltyfreemusiclibrary.com/\nAll Supplemental Music fromOpus 1 Music: https://opus1.sourceaudio.com/\nWe use ‘The Mouse’ by Blue Microphones https://www.bluemic.com/mouse/ 4 I know it's been a while since we did this show, but we're back with what might be the best episode yet!\nLeave your dares in the comment section! \n\nOrder my book how to write good \nhttp://higatv.com/ryan-higas-how-to-write-good-pre-order-links/\n\nJust Launched New Official Store\nhttps://www.gianthugs.com/collections/ryan\n\nHigaTV Channel\nhttp://www.youtube.com/higatv\n\nTwitter\nhttp://www.twitter.com/therealryanhiga\n\nFacebook\nhttp://www.facebook.com/higatv\n\nWebsite\nhttp://www.higatv.com\n\nInstagram\nhttp://www.instagram.com/notryanhiga\n\nSend us mail or whatever you want here!\nPO Box 232355\nLas Vegas, NV 89105
youtube.shape
40949 | 16 |
youtube.dtypes
video_id object trending_date object title object channel_title object category_id int64 publish_time object tags object views int64 likes int64 dislikes int64 comment_count int64 thumbnail_link object comments_disabled bool ratings_disabled bool video_error_or_removed bool description object dtype: object
2.1.1. Handling text features
title, channel_title, tags & description
are text features. Let's
investigate them further.
text_features = ['title', 'channel_title', 'tags', 'description'] text = youtube[text_features] text
title \ 0 WE WANT TO TALK ABOUT OUR MARRIAGE 1 The Trump Presidency: Last Week Tonight with John Oliver (HBO) 2 Racist Superman | Rudy Mancuso, King Bach & Lele Pons 3 Nickelback Lyrics: Real or Fake? 4 I Dare You: GOING BALD!? ... ... 40944 The Cat Who Caught the Laser 40945 True Facts : Ant Mutualism 40946 I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER BASED ON HER FEATURES: BTS! |bradmondo 40947 How Black Panther Should Have Ended 40948 Official Call of Duty®: Black Ops 4 — Multiplayer Reveal Trailer channel_title \ 0 CaseyNeistat 1 LastWeekTonight 2 Rudy Mancuso 3 Good Mythical Morning 4 nigahiga ... ... 40944 AaronsAnimals 40945 zefrank1 40946 Brad Mondo 40947 How It Should Have Ended 40948 Call of Duty tags \ 0 SHANtell martin 1 last week tonight trump presidency|"last week tonight donald trump"|"john oliver trump"|"donald trump" 2 racist superman|"rudy"|"mancuso"|"king"|"bach"|"racist"|"superman"|"love"|"rudy mancuso poo bear black white official music video"|"iphone x by pineapple"|"lelepons"|"hannahstocking"|"rudymancuso"|"inanna"|"anwar"|"sarkis"|"shots"|"shotsstudios"|"alesso"|"anitta"|"brazil"|"Getting My Driver's License | Lele Pons" 3 rhett and link|"gmm"|"good mythical morning"|"rhett and link good mythical morning"|"good mythical morning rhett and link"|"mythical morning"|"Season 12"|"nickelback lyrics"|"nickelback lyrics real or fake"|"nickelback"|"nickelback songs"|"nickelback song"|"rhett link nickelback"|"gmm nickelback"|"lyrics (website category)"|"nickelback (musical group)"|"rock"|"music"|"lyrics"|"chad kroeger"|"canada"|"music (industry)"|"mythical"|"gmm challenge"|"comedy"|"funny"|"challenge" 4 ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"idy"|"rhpc"|"dares"|"no truth"|"comments"|"comedy"|"funny"|"stupid"|"fail" ... ... 40944 aarons animals|"aarons"|"animals"|"cat"|"cats"|"kitten"|"kittens"|"prince michael"|"prince"|"michael"|"laser"|"olympics"|"red"|"dream" 40945 [none] 40946 I gave safiya nygaard a perfect hair makeover based on her features: bts|"brad mondo"|"safiya and tyler"|"safiya nygaard"|"hair transformation"|"makeover"|"I got a perfect makeover based on my features"|"bts"|"hairdresser reacts"|"before and after"|"hair"|"makeup"|"transformation"|"ANTM"|"what not to wear"|"the ideal haircut and color for your face"|"safiya buzzfeed"|"color for your skin tone"|"haircut for your face shape"|"tutorial"|"balayage"|"hair stylist"|"hair color"|"hair tutorial" 40947 Black Panther|"HISHE"|"Marvel"|"Infinity War"|"How It Should Have Ended"|"parody"|"comedy"|"entertainment"|"wakanda"|"Chadwick Boseman"|"Michael B Jordan"|"movies"|"animation"|"fortnite"|"azerrz"|"movie"|"plothole"|"review"|"childish gambino"|"donald glover"|"this is america"|"ending explained" 40948 call of duty|"cod"|"activision"|"Black Ops 4" description 0 SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\nCANDICE - https://www.lovebilly.com\n\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\nwith this lens -- http://amzn.to/2rUJOmD\nbig drone - http://tinyurl.com/h4ft3oy\nOTHER GEAR --- http://amzn.to/2o3GLX5\nSony CAMERA http://amzn.to/2nOBmnv\nOLD CAMERA; http://amzn.to/2o2cQBT\nMAIN LENS; http://amzn.to/2od5gBJ\nBIG SONY CAMERA; http://amzn.to/2nrdJRO\nBIG Canon CAMERA; http://tinyurl.com/jn4q4vz\nBENDY TRIPOD THING; http://tinyurl.com/gw3ylz2\nYOU NEED THIS FOR THE BENDY TRIPOD; http://tinyurl.com/j8mzzua\nWIDE LENS; http://tinyurl.com/jkfcm8t\nMORE EXPENSIVE WIDE LENS; http://tinyurl.com/zrdgtou\nSMALL CAMERA; http://tinyurl.com/hrrzhor\nMICROPHONE; http://tinyurl.com/zefm4jy\nOTHER MICROPHONE; http://tinyurl.com/jxgpj86\nOLD DRONE (cheaper but still great);http://tinyurl.com/zcfmnmd\n\nfollow me; on http://instagram.com/caseyneistat\non https://www.facebook.com/cneistat\non https://twitter.com/CaseyNeistat\n\namazing intro song by https://soundcloud.com/discoteeth\n\nad disclosure. THIS IS NOT AN AD. not selling or promoting anything. but samsung did produce the Shantell Video as a 'GALAXY PROJECT' which is an initiative that enables creators like Shantell and me to make projects we might otherwise not have the opportunity to make. hope that's clear. if not ask in the comments and i'll answer any specifics. 1 One year after the presidential election, John Oliver discusses what we've learned so far and enlists our catheter cowboy to teach Donald Trump what he hasn't.\n\nConnect with Last Week Tonight online...\n\nSubscribe to the Last Week Tonight YouTube channel for more almost news as it almost happens: www.youtube.com/user/LastWeekTonight\n\nFind Last Week Tonight on Facebook like your mom would: http://Facebook.com/LastWeekTonight\n\nFollow us on Twitter for news about jokes and jokes about news: http://Twitter.com/LastWeekTonight\n\nVisit our official site for all that other stuff at once: http://www.hbo.com/lastweektonight 2 WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_confirmation=1\n\nTHANKS FOR WATCHING! LIKE & SUBSCRIBE FOR MORE VIDEOS!\n-----------------------------------------------------------\nFIND ME ON: \nInstagram | http://instagram.com/rudymancuso\nTwitter | http://twitter.com/rudymancuso\nFacebook | http://facebook.com/rudymancuso\n\nCAST: \nRudy Mancuso | http://youtube.com/c/rudymancuso\nLele Pons | http://youtube.com/c/lelepons\nKing Bach | https://youtube.com/user/BachelorsPadTv\n\nVideo Effects: \nCaleb Natale | https://instagram.com/calebnatale\n\nPA:\nPaulina Gregory\n\n\nShots Studios Channels:\nAlesso | https://youtube.com/c/alesso\nAnitta | http://youtube.com/c/anitta\nAnwar Jibawi | http://youtube.com/c/anwar\nAwkward Puppets | http://youtube.com/c/awkwardpuppets\nHannah Stocking | http://youtube.com/c/hannahstocking\nInanna Sarkis | http://youtube.com/c/inanna\nLele Pons | http://youtube.com/c/lelepons\nMaejor | http://youtube.com/c/maejor\nMike Tyson | http://youtube.com/c/miketyson \nRudy Mancuso | http://youtube.com/c/rudymancuso\nShots Studios | http://youtube.com/c/shots\n\n#Rudy\n#RudyMancuso 3 Today we find out if Link is a Nickelback amateur or a secret Nickelback devotee. GMM #1218\nDon't miss an all new Ear Biscuits: https://goo.gl/xeZNQt\nWatch Part 4: https://youtu.be/MhCdiiB8CQg | Watch Part 2: https://youtu.be/7qiOrNao9fg\nWatch today's episode from the start: http://bit.ly/GMM1218\n\nPick up all of the official GMM merch only at https://mythical.store\n\nFollow Rhett & Link: \nInstagram: https://instagram.com/rhettandlink\nFacebook: https://facebook.com/rhettandlink\nTwitter: https://twitter.com/rhettandlink\nTumblr: https://rhettandlink.tumblr.com\nSnapchat: @realrhettlink\nWebsite: https://mythical.co/\n\nCheck Out Our Other Mythical Channels:\nGood Mythical MORE: https://youtube.com/goodmythicalmore\nRhett & Link: https://youtube.com/rhettandlink\nThis Is Mythical: https://youtube.com/thisismythical\nEar Biscuits: https://applepodcasts.com/earbiscuits\n\nWant to send us something? https://mythical.co/contact\nHave you made a Wheel of Mythicality intro video? Submit it here: https://bit.ly/GMMWheelIntro\n\nIntro Animation by Digital Twigs: https://www.digitaltwigs.com\nIntro & Outro Music by Jeff Zeigler & Sarah Schimeneck https://www.jeffzeigler.com\nWheel of Mythicality theme: https://www.royaltyfreemusiclibrary.com/\nAll Supplemental Music fromOpus 1 Music: https://opus1.sourceaudio.com/\nWe use ‘The Mouse’ by Blue Microphones https://www.bluemic.com/mouse/ 4 I know it's been a while since we did this show, but we're back with what might be the best episode yet!\nLeave your dares in the comment section! \n\nOrder my book how to write good \nhttp://higatv.com/ryan-higas-how-to-write-good-pre-order-links/\n\nJust Launched New Official Store\nhttps://www.gianthugs.com/collections/ryan\n\nHigaTV Channel\nhttp://www.youtube.com/higatv\n\nTwitter\nhttp://www.twitter.com/therealryanhiga\n\nFacebook\nhttp://www.facebook.com/higatv\n\nWebsite\nhttp://www.higatv.com\n\nInstagram\nhttp://www.instagram.com/notryanhiga\n\nSend us mail or whatever you want here!\nPO Box 232355\nLas Vegas, NV 89105 ... ... 40944 The Cat Who Caught the Laser - Aaron's Animals 40945 NaN 40946 I had so much fun transforming Safiyas hair in this video! She was serving major lewks!SAFIYAS VIDEO▷https://goo.gl/C92AmbSHOP MY LIMITED EDITION HOODIE!▷ https://goo.gl/VN6tVD LET'S BE BFFS!INSTAGRAM ▷ https://www.instagram.com/bradmondonyc/TWITTER ▷ https://twitter.com/bradmondonycFACEBOOK ▷ https://www.facebook.com/bradmondonyc/WANNA SEE MORE OF MY FACE? ▷ https://goo.gl/QjHDAuWANNA SEE MY LAST VIDEO? ▷ https://goo.gl/exP6gWFILMING EQUIPMENT: UMBRELLA LIGHTS▷ http://amzn.to/2qNy9K4RING LIGHT▷ http://amzn.to/2Erv1p9CAMERA▷ http://amzn.to/2EsXQRYCAMERA LENS▷http://amzn.to/2DdlN0rTRIPOD▷ http://amzn.to/2mePXbDMIC▷ http://amzn.to/2Bpt9PHBACKGROUND PAPER▷http://amzn.to/2FkKKHXWANT AN INTRO LIKE MINE? CONTACT▷www.marcelsaleta.comDON'T FORGET TO LIVE YOUR EXTRA LIFE! 😁 40947 How Black Panther Should Have EndedWatch More HISHEs: https://bit.ly/HISHEPlaylistSubscribe to HISHE: https://bit.ly/HISHEsubscribeTwitter @theHISHEdotcomhttps://twitter.com/TheHISHEdotcomInstagram @HISHEgramhttps://instagram.com/hishegram/Facebook:https://www.facebook.com/howitshouldhaveended/HISHE Swag:http://www.dftba.com/hisheSpecial Thanks to Guest Voices Azerzz https://www.youtube.com/user/HeyitzAzerrzNicholas Andrew Louie https://www.youtube.com/user/NicholasAndrewLouie--------------Previous Episodes--------------------Avengers Infinity War and Beyond (Toy Story Mashup)https://youtu.be/bvXxLp_G9w0How IT Should Have Endedhttps://youtu.be/gh0WvZtbATEVillain Pub - The Dead Poolhttps://youtu.be/3DGlk_JAm8UHow Justice League Should Have Endedhttps://youtu.be/zj_y8eAKpQUHow Star Wars The Last Jedi Should Have Endedhttps://youtu.be/rCB8DUGpYQQHow Thor Ragnarok Should Have Endedhttps://youtu.be/lPZRmkVLeOEHow Spider-Man Homecoming Should Have Endedhttps://youtu.be/hjuHNdEgN30Batman V Superman - Comedy Recaphttps://youtu.be/bNjhtHyihJ0How The Incredibles Should Have Endedhttps://youtu.be/C0VJaFN4bncVillain Pub - Penny For Your Fearshttps://youtu.be/ZLyulYMZbj8Blade Runner - Comedy Recaphttps://youtu.be/fVwe9xXBdHQRogue One LEGO HISHE - Chirrut VS Everythinghttps://youtu.be/T5jK4XnaAQQHow Wonder Woman Should Have Endedhttps://youtu.be/Lf6gl4-MPd0How Guardians of the Galaxy Vol.2 Should Have Endedhttps://youtu.be/GBxxjhnjH4YHow Jurassic World Should Have Ended:https://youtu.be/TXGCyjJh48IHulk Spoils Movieshttps://youtu.be/HAg3fuFczs0How Kong Skull Island Should Have Endedhttps://youtu.be/C3H0OWehlVI?list=PL3...How Rogue One Should Have Endedhttps://youtu.be/RjR71XpAu0I?list=PL3...How Moana Should Have Endedhttps://youtu.be/4aHGssCxMo4?list=PL3...How The LEGO Batman Movie Should Have Endedhttps://youtu.be/g7OH2OhIjJAHow Doctor Strange Should Have Endedhttps://youtu.be/9e5epVDd9h0?list=PL3...How Beauty and the Beast Should Have Endedhttps://youtu.be/8hm9ezomDhQHow Star Wars Should Have Ended (Special Edition)https://youtu.be/oXUJiHut7YE?list=PLi...More HISHE Reviewshttps://www.youtube.com/playlist?list...Villain Pub - The Boss Battlehttps://youtu.be/bt__1gwGZSA?list=PL3...LEGO Harry Potter in 90 Secondshttps://youtu.be/jnbBcAr7XGo?list=PL3...Suicide Squad HISHEhttps://youtu.be/Wje0SdFWrzUHow Guardians of the Galaxy Vol.1 Should Have Endedhttps://youtu.be/d0K436vUM4wStar Trek Beyond HISHEhttps://youtu.be/Fymz7yoELS4?list=PL3...Super Cafe: Batman GOhttps://youtu.be/KntOy6am7CM?list=PL3...Civil War HISHEhttps://youtu.be/fvLw021rVN0Villain Pub - The New Smilehttps://youtu.be/0oP8s4GK1BE?list=PLA...How Batman V Superman Should Have Endedhttps://youtu.be/pTuyfQ5CR4QTMNT: Out of the Shadows HISHEhttps://youtu.be/_ac8xKxeqzk?list=PL3...How Deadpool Should Have Endedhttps://youtu.be/5vbEcTIAdPs?list=PL3...Hero Swap - Gladiator Starring Iron Manhttps://youtu.be/P4mY4qmuJas?list=PL3...How X-Men: Days of Future Past Should Have Ended:https://youtu.be/uT6YOI6JcRsStar Wars - Revenge of the Sith HISHEhttps://youtu.be/K2ScVx4mRDEJungle Book HISHEhttps://youtu.be/WcfDDa5YoV8?list=PL3...BAT BLOOD - A Batman V Superman AND Bad Blood Parody ft. Batman:https://youtu.be/maX-ObiJB3oVillain Pub - The New Smile:https://youtu.be/0oP8s4GK1BE 40948 Call of Duty: Black Ops 4 Multiplayer raises the bar for the famed multiplayer mode delivering the most thrilling grounded combat experience yet with a focus on tactical gameplay and player choice.Call of Duty®: Black Ops 4 is available October 12, 2018. Pre-order at participating retailers on disc or digital download and get Private Beta Access: https://www.callofduty.com/blackops4/buyFollow us for all the latest intel: Web: http://www.CallofDuty.com ;Facebook: http://facebook.com/CallofDuty and http://www.facebook.com/Treyarch/;Twitter: http://twitter.com/CallofDuty and http://twitter.com/Treyarch;Instagram: http://instagram.com/CallofDuty and http://www.instagram.com/treyarch/;Snapchat: callofduty [40949 rows x 4 columns]
It's possible to extract some numerical features (especially from tags) however for this analysis we don't consider these features.
Another observation is the presence of multiple languages (observed while browsing the data on Kaggle). This problem is only amplified when we consider data from all other countries.
youtube = youtube.drop(text_features, axis='columns') youtube.shape
40949 | 12 |
2.1.2. Handling redundant columns
thumbnail_link
is redundant, we can drop it.
youtube = youtube.drop('thumbnail_link', axis='columns') youtube.shape
40949 | 11 |
The video_id
column is redundant if we decide to fit a ML model on
the dataset (since we don't want the model to learn the unique
identifier of a video, rather learn general attributes from each
example). Since there is no ML target for this dataset, we don't do
anything.
2.1.3. Handling datetime features
trending_date & publish_time
contain datetime info, they should be
converted to the datetime
dtype.
datetime_features = ['trending_date', 'publish_time'] datetime = youtube[datetime_features] datetime
trending_date publish_time 0 17.14.11 2017-11-13T17:13:01.000Z 1 17.14.11 2017-11-13T07:30:00.000Z 2 17.14.11 2017-11-12T19:05:24.000Z 3 17.14.11 2017-11-13T11:00:04.000Z 4 17.14.11 2017-11-12T18:01:41.000Z ... ... ... 40944 18.14.06 2018-05-18T13:00:04.000Z 40945 18.14.06 2018-05-18T01:00:06.000Z 40946 18.14.06 2018-05-18T17:34:22.000Z 40947 18.14.06 2018-05-17T17:00:04.000Z 40948 18.14.06 2018-05-17T17:09:38.000Z [40949 rows x 2 columns]
Looks like both of them have custom formats and require some wrangling
to get them into a standardised format. The format for publish_time
is particularly odd, without documentation it's hard to decypher what
the 'T' & the 'Z' stand for.
2.1.4. Handling categorical features
comments_disabled, ratings_disabled & video_error_or_removed
are
bool, but we can extract numerical features from them.
bool_features = ['comments_disabled', 'ratings_disabled', 'video_error_or_removed'] bools = youtube[bool_features] bools
comments_disabled ratings_disabled video_error_or_removed 0 False False False 1 False False False 2 False False False 3 False False False 4 False False False ... ... ... ... 40944 False False False 40945 False False False 40946 False False False 40947 False False False 40948 False False False [40949 rows x 3 columns]
My intuition tells me that category_id
is also categorical, let's
investigate.
youtube['category_id'].value_counts()
24 9964 10 6472 26 4146 23 3457 22 3210 25 2487 28 2401 1 2345 17 2174 27 1656 15 920 20 817 19 402 2 384 29 57 43 57 Name: category_id, dtype: int64
2.1.5. Descriptive statistics, missing & duplicates
Let's look at the descriptive statistics next.
youtube.describe(include='all')
video_id trending_date category_id publish_time \ count 40949 40949 40949.000000 40949 unique 6351 205 NaN 6269 top j4KvrAUjn6c 17.14.11 NaN 2018-05-18T14:00:04.000Z freq 30 200 NaN 50 mean NaN NaN 19.972429 NaN std NaN NaN 7.568327 NaN min NaN NaN 1.000000 NaN 25% NaN NaN 17.000000 NaN 50% NaN NaN 24.000000 NaN 75% NaN NaN 25.000000 NaN max NaN NaN 43.000000 NaN views likes dislikes comment_count \ count 4.094900e+04 4.094900e+04 4.094900e+04 4.094900e+04 unique NaN NaN NaN NaN top NaN NaN NaN NaN freq NaN NaN NaN NaN mean 2.360785e+06 7.426670e+04 3.711401e+03 8.446804e+03 std 7.394114e+06 2.288853e+05 2.902971e+04 3.743049e+04 min 5.490000e+02 0.000000e+00 0.000000e+00 0.000000e+00 25% 2.423290e+05 5.424000e+03 2.020000e+02 6.140000e+02 50% 6.818610e+05 1.809100e+04 6.310000e+02 1.856000e+03 75% 1.823157e+06 5.541700e+04 1.938000e+03 5.755000e+03 max 2.252119e+08 5.613827e+06 1.674420e+06 1.361580e+06 comments_disabled ratings_disabled video_error_or_removed count 40949 40949 40949 unique 2 2 2 top False False False freq 40316 40780 40926 mean NaN NaN NaN std NaN NaN NaN min NaN NaN NaN 25% NaN NaN NaN 50% NaN NaN NaN 75% NaN NaN NaN max NaN NaN NaN
Let's look for missing values next.
youtube.isna().any()
video_id False trending_date False category_id False publish_time False views False likes False dislikes False comment_count False comments_disabled False ratings_disabled False video_error_or_removed False dtype: bool
And check for duplicates.
youtube[youtube.duplicated(keep=False)]
video_id trending_date category_id publish_time \ 34750 QBL8IRJ5yHU 18.15.05 26 2018-05-14T19:00:01.000Z 34751 t4pRQ0jn23Q 18.15.05 24 2018-05-14T14:00:03.000Z 34752 j4KvrAUjn6c 18.15.05 24 2018-05-13T18:03:56.000Z 34753 MAjY8mCTXWk 18.15.05 10 2018-05-14T15:59:47.000Z 34754 xhs8tf1v__w 18.15.05 24 2018-05-14T16:00:29.000Z ... ... ... ... ... 34944 iILJvqrAQ_w 18.15.05 10 2018-05-11T04:00:34.000Z 34945 zcEE8J2Bqa8 18.15.05 23 2018-05-11T18:27:01.000Z 34946 q1jzwV_s8_Y 18.15.05 10 2018-05-11T07:00:01.000Z 34947 mkz1zoo15zI 18.15.05 17 2018-05-11T19:21:53.000Z 34948 2PH7dK6SLC8 18.15.05 10 2018-05-10T17:00:01.000Z views likes dislikes comment_count comments_disabled \ 34750 1469627 188652 3124 33032 False 34751 1199587 49709 2380 7261 False 34752 3906727 77378 12160 15874 False 34753 916128 40485 1042 4746 False 34754 343967 16988 132 1308 False ... ... ... ... ... ... 34944 2124177 81085 1321 4019 False 34945 165617 20572 140 1407 False 34946 1869585 64523 1891 5903 False 34947 472999 3505 163 1511 False 34948 1201548 51670 964 4264 False ratings_disabled video_error_or_removed 34750 False False 34751 False False 34752 False False 34753 False False 34754 False False ... ... ... 34944 False False 34945 False False 34946 False False 34947 False False 34948 False False [96 rows x 11 columns]
Duplicates are possible here since the same video may be trending at different times. Video may also have similar number of likes, dislikes, etc.
2.1.6. Correlations
Let's check for correlations next.
name = 'heatmap@youtube--corr.png' corr = youtube.corr() plotter.corr(corr, name) name
Some numerical features are positively correlated to one another.