Comic
Table of Contents
1. Init
Let's start by importing the necessary modules.
from context import src from context import pd, np, sns, plt from src import utils, plotter
2. Analysis
In this section we analyse the comic-*
datasets. We start by reading
the data docs which provides some information on the data collection
process and the features within the dataset.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
marvel = pd.read_csv(utils.data_path('comic-marvel.csv')) marvel.head()
page_id name \ 0 1678 Spider-Man (Peter Parker) 1 7139 Captain America (Steven Rogers) 2 64786 Wolverine (James \"Logan\" Howlett) 3 1868 Iron Man (Anthony \"Tony\" Stark) 4 2460 Thor (Thor Odinson) urlslug ID \ 0 \/Spider-Man_(Peter_Parker) Secret Identity 1 \/Captain_America_(Steven_Rogers) Public Identity 2 \/Wolverine_(James_%22Logan%22_Howlett) Public Identity 3 \/Iron_Man_(Anthony_%22Tony%22_Stark) Public Identity 4 \/Thor_(Thor_Odinson) No Dual Identity ALIGN EYE HAIR SEX GSM \ 0 Good Characters Hazel Eyes Brown Hair Male Characters NaN 1 Good Characters Blue Eyes White Hair Male Characters NaN 2 Neutral Characters Blue Eyes Black Hair Male Characters NaN 3 Good Characters Blue Eyes Black Hair Male Characters NaN 4 Good Characters Blue Eyes Blond Hair Male Characters NaN ALIVE APPEARANCES FIRST APPEARANCE Year 0 Living Characters 4043.0 Aug-62 1962.0 1 Living Characters 3360.0 Mar-41 1941.0 2 Living Characters 3061.0 Oct-74 1974.0 3 Living Characters 2961.0 Mar-63 1963.0 4 Living Characters 2258.0 Nov-50 1950.0
dc = pd.read_csv(utils.data_path('comic-dc.csv')) dc.head()
page_id name urlslug \ 0 1422 Batman (Bruce Wayne) \/wiki\/Batman_(Bruce_Wayne) 1 23387 Superman (Clark Kent) \/wiki\/Superman_(Clark_Kent) 2 1458 Green Lantern (Hal Jordan) \/wiki\/Green_Lantern_(Hal_Jordan) 3 1659 James Gordon (New Earth) \/wiki\/James_Gordon_(New_Earth) 4 1576 Richard Grayson (New Earth) \/wiki\/Richard_Grayson_(New_Earth) ID ALIGN EYE HAIR SEX \ 0 Secret Identity Good Characters Blue Eyes Black Hair Male Characters 1 Secret Identity Good Characters Blue Eyes Black Hair Male Characters 2 Secret Identity Good Characters Brown Eyes Brown Hair Male Characters 3 Public Identity Good Characters Brown Eyes White Hair Male Characters 4 Secret Identity Good Characters Blue Eyes Black Hair Male Characters GSM ALIVE APPEARANCES FIRST APPEARANCE YEAR 0 NaN Living Characters 3093.0 1939, May 1939.0 1 NaN Living Characters 2496.0 1986, October 1986.0 2 NaN Living Characters 1565.0 1959, October 1959.0 3 NaN Living Characters 1316.0 1987, February 1987.0 4 NaN Living Characters 1237.0 1940, April 1940.0
marvel.shape
16376 | 13 |
dc.shape
6896 | 13 |
marvel.columns
Index(['page_id', 'name', 'urlslug', 'ID', 'ALIGN', 'EYE', 'HAIR', 'SEX', 'GSM', 'ALIVE', 'APPEARANCES', 'FIRST APPEARANCE', 'Year'], dtype='object')
dc.columns
Index(['page_id', 'name', 'urlslug', 'ID', 'ALIGN', 'EYE', 'HAIR', 'SEX', 'GSM', 'ALIVE', 'APPEARANCES', 'FIRST APPEARANCE', 'YEAR'], dtype='object')
Both datasets contain the same number of features however, we do need
to make sure that the columns are named the same way prior to
concatenating them together. We also add a universe
column so that
we can identify which dataset the characters came from prior to
contating.
marvel['universe'] = 'marvel' marvel.columns = marvel.columns.str.lower() dc['universe'] = 'dc' dc.columns = dc.columns.str.lower()
comic = pd.concat([marvel, dc], ignore_index=True) comic.shape
23272 | 14 |
comic.dtypes
page_id int64 name object urlslug object id object align object eye object hair object sex object gsm object alive object appearances float64 first appearance object year float64 universe object dtype: object
We have a mix of text and numerical features. Let's examine them closely.
2.1.1. Handling redundant columns
urlslug
does not add any new information and can be dropped.
comic = comic.drop('urlslug', axis='columns') comic.shape
23272 | 13 |
page_id
provides a unique id for the characters within the wiki
(website) from where the data was scrapped. Let's investigate this
feature a bit further.
page_id = comic['page_id'] page_id[page_id.duplicated(keep=False)]
3 1868 7 1833 9 1837 11 1863 13 2614 ... 23253 1485 23257 1624 23259 1581 23260 1473 23261 1460 Name: page_id, Length: 952, dtype: int64
Looks like we have lots of duplicates, let's look at a few duplicated examples.
comic[page_id.eq(1868)]
page_id name id \ 3 1868 Iron Man (Anthony \"Tony\" Stark) Public Identity 16491 1868 James Corrigan (New Earth) Secret Identity align eye hair sex gsm \ 3 Good Characters Blue Eyes Black Hair Male Characters NaN 16491 Good Characters Blue Eyes Red Hair Male Characters NaN alive appearances first appearance year universe 3 Living Characters 2961.0 Mar-63 1963.0 marvel 16491 Deceased Characters 214.0 1940, February 1940.0 dc
Characters can appear on the same page which makes sense. However, the
page_id
for dc and marvel characters most likely are not the same.
I don't think this feature adds any valuable information, so we drop
it as well.
comic = comic.drop('page_id',axis='columns') comic.shape
23272 | 12 |
In a sense, name
also serves as a unique identifier (although we can
have same names). We may choose to treat this as a categorical feature
or a text feature. Either way, it requires some data wrangling before
extracting useful information. This is beyond the scope of this
project so we drop name
.
comic = comic.drop('name', axis='columns') comic.shape
23272 | 11 |
2.1.2. Handling numerical features
appearances & year
are represented as floats but should be ints.
numerical_features = ['appearances', 'year'] for feature in numerical_features: comic[feature] = comic[feature].astype('int')
2.1.3. Handling categorical features
id, align, eye, hair, sex, gsm & alive
are categorical features and
should be represented as the category
dtype.
categorical_features = ['id', 'align', 'eye', 'hair', 'sex', 'gsm', 'alive'] for feature in categorical_features: print(comic[feature].astype('category').cat.categories)
Index(['Identity Unknown', 'Known to Authorities Identity', 'No Dual Identity', 'Public Identity', 'Secret Identity'], dtype='object') Index(['Bad Characters', 'Good Characters', 'Neutral Characters', 'Reformed Criminals'], dtype='object') Index(['Amber Eyes', 'Auburn Hair', 'Black Eyeballs', 'Black Eyes', 'Blue Eyes', 'Brown Eyes', 'Compound Eyes', 'Gold Eyes', 'Green Eyes', 'Grey Eyes', 'Hazel Eyes', 'Magenta Eyes', 'Multiple Eyes', 'No Eyes', 'One Eye', 'Orange Eyes', 'Photocellular Eyes', 'Pink Eyes', 'Purple Eyes', 'Red Eyes', 'Silver Eyes', 'Variable Eyes', 'Violet Eyes', 'White Eyes', 'Yellow Eyeballs', 'Yellow Eyes'], dtype='object') Index(['Auburn Hair', 'Bald', 'Black Hair', 'Blond Hair', 'Blue Hair', 'Bronze Hair', 'Brown Hair', 'Dyed Hair', 'Gold Hair', 'Green Hair', 'Grey Hair', 'Light Brown Hair', 'Magenta Hair', 'No Hair', 'Orange Hair', 'Orange-brown Hair', 'Pink Hair', 'Platinum Blond Hair', 'Purple Hair', 'Red Hair', 'Reddish Blond Hair', 'Reddish Brown Hair', 'Silver Hair', 'Strawberry Blond Hair', 'Variable Hair', 'Violet Hair', 'White Hair', 'Yellow Hair'], dtype='object') Index(['Agender Characters', 'Female Characters', 'Genderfluid Characters', 'Genderless Characters', 'Male Characters', 'Transgender Characters'], dtype='object') Index(['Bisexual Characters', 'Genderfluid Characters', 'Homosexual Characters', 'Pansexual Characters', 'Transgender Characters', 'Transvestites'], dtype='object') Index(['Deceased Characters', 'Living Characters'], dtype='object')
No string sanitation required, let's convert them to category
dtype.
for feature in categorical_features: comic[feature] = comic[feature].astype('category')
eye & hair
contain a lot of values, we may consider binning them.
2.1.4. Handling datetime features
first appearance
contains a timestamp in a custom format. The format
seems to be month-year
where month is the 3 character abbreviation
and year contains only the last 2 digits. Since we already have the
full year in the year
column, we can consider extracting the month
and creating a new month
column. We can then drop the first
appearance
column.
comic['month'] = comic['first appearance'].str.extract('(\w+)') comic['month'].astype('category').cat.categories
Index(['1935', '1936', '1937', '1938', '1939', '1940', '1941', '1942', '1943', '1944', '1945', '1946', '1947', '1948', '1949', '1950', '1951', '1952', '1953', '1954', '1955', '1956', '1957', '1958', '1959', '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', 'Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep'], dtype='object')
It appears extracting the month requires a bit more work than anticipated. As this won't aid in our analysis, we simply drop the feature for now.
comic = comic.drop(['first appearance', 'month'], axis='columns') comic.shape
23272 | 10 |
2.1.5. Descriptive statistics, missing & duplicates
Let's look at the descriptive statistics next.
comic.describe(include='all')
id align eye hair \ count 17489 19859 9877 16734 unique 5 4 26 28 top Secret Identity Bad Characters Blue Eyes Black Hair freq 8683 9615 3064 5329 mean NaN NaN NaN NaN std NaN NaN NaN NaN min NaN NaN NaN NaN 25% NaN NaN NaN NaN 50% NaN NaN NaN NaN 75% NaN NaN NaN NaN max NaN NaN NaN NaN sex gsm alive \ count 22293 154 23266 unique 6 6 2 top Male Characters Homosexual Characters Living Characters freq 16421 120 17808 mean NaN NaN NaN std NaN NaN NaN min NaN NaN NaN 25% NaN NaN NaN 50% NaN NaN NaN 75% NaN NaN NaN max NaN NaN NaN appearances year universe count 21821.000000 22388.000000 23272 unique NaN NaN 2 top NaN NaN marvel freq NaN NaN 16376 mean 19.009303 1986.420046 NaN std 93.814040 18.972698 NaN min 1.000000 1935.000000 NaN 25% 1.000000 1976.000000 NaN 50% 4.000000 1990.000000 NaN 75% 10.000000 2001.000000 NaN max 4043.000000 2013.000000 NaN
Let's check for missing values next.
comic[comic.isna().any(axis='columns')].shape
23176 | 10 |
comic.isna().sum()
id 5783 align 3413 eye 13395 hair 6538 sex 979 gsm 23118 alive 6 appearances 1451 year 884 universe 0 dtype: int64
~99% of the data is missing but this is primarily due to the gsm
column. Even if we drop this column, other features still contain too
many missing values thus we have to consider imputation.
Let's check for duplicates next.
comic[comic.duplicated(keep=False)]
id align eye hair \ 1542 Secret Identity Bad Characters NaN Brown Hair 1544 Secret Identity Bad Characters NaN Brown Hair 1736 Public Identity Good Characters Blue Eyes Auburn Hair 1737 Public Identity Good Characters Blue Eyes Auburn Hair 1824 Secret Identity Good Characters NaN Red Hair ... ... ... ... ... 23264 Public Identity Good Characters NaN NaN 23266 Public Identity Good Characters NaN NaN 23268 Public Identity Good Characters NaN NaN 23269 Public Identity Good Characters NaN NaN 23270 Public Identity Good Characters NaN NaN sex gsm alive appearances year \ 1542 Male Characters NaN Living Characters 24.0 1993.0 1544 Male Characters NaN Living Characters 24.0 1993.0 1736 Male Characters NaN Deceased Characters 21.0 1986.0 1737 Male Characters NaN Deceased Characters 21.0 1986.0 1824 Female Characters NaN Living Characters 20.0 1994.0 ... ... ... ... ... ... 23264 Male Characters NaN Living Characters NaN NaN 23266 Male Characters NaN Living Characters NaN NaN 23268 Male Characters NaN Living Characters NaN NaN 23269 Male Characters NaN Living Characters NaN NaN 23270 Male Characters NaN Living Characters NaN NaN universe 1542 marvel 1544 marvel 1736 marvel 1737 marvel 1824 marvel ... ... 23264 dc 23266 dc 23268 dc 23269 dc 23270 dc [4226 rows x 10 columns]
We have duplicate rows but this is possible since several characters may posses similar looks (eye & hair color). No steps necessary for this dataset.
2.1.6. Correlations
There is no point checking for correlations here since we only have 2 numerical features.