Comic

Table of Contents

1. Init

Let's start by importing the necessary modules.

from context import src
from context import pd, np, sns, plt
from src import utils, plotter

2. Analysis

In this section we analyse the comic-* datasets. We start by reading the data docs which provides some information on the data collection process and the features within the dataset.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

marvel = pd.read_csv(utils.data_path('comic-marvel.csv'))
marvel.head()
   page_id                                 name  \
0     1678            Spider-Man (Peter Parker)   
1     7139      Captain America (Steven Rogers)   
2    64786  Wolverine (James \"Logan\" Howlett)   
3     1868    Iron Man (Anthony \"Tony\" Stark)   
4     2460                  Thor (Thor Odinson)   

                                   urlslug                ID  \
0              \/Spider-Man_(Peter_Parker)   Secret Identity   
1        \/Captain_America_(Steven_Rogers)   Public Identity   
2  \/Wolverine_(James_%22Logan%22_Howlett)   Public Identity   
3    \/Iron_Man_(Anthony_%22Tony%22_Stark)   Public Identity   
4                    \/Thor_(Thor_Odinson)  No Dual Identity   

                ALIGN         EYE        HAIR              SEX  GSM  \
0     Good Characters  Hazel Eyes  Brown Hair  Male Characters  NaN   
1     Good Characters   Blue Eyes  White Hair  Male Characters  NaN   
2  Neutral Characters   Blue Eyes  Black Hair  Male Characters  NaN   
3     Good Characters   Blue Eyes  Black Hair  Male Characters  NaN   
4     Good Characters   Blue Eyes  Blond Hair  Male Characters  NaN   

               ALIVE  APPEARANCES FIRST APPEARANCE    Year  
0  Living Characters       4043.0           Aug-62  1962.0  
1  Living Characters       3360.0           Mar-41  1941.0  
2  Living Characters       3061.0           Oct-74  1974.0  
3  Living Characters       2961.0           Mar-63  1963.0  
4  Living Characters       2258.0           Nov-50  1950.0  
dc = pd.read_csv(utils.data_path('comic-dc.csv'))
dc.head()
   page_id                         name                              urlslug  \
0     1422         Batman (Bruce Wayne)         \/wiki\/Batman_(Bruce_Wayne)   
1    23387        Superman (Clark Kent)        \/wiki\/Superman_(Clark_Kent)   
2     1458   Green Lantern (Hal Jordan)   \/wiki\/Green_Lantern_(Hal_Jordan)   
3     1659     James Gordon (New Earth)     \/wiki\/James_Gordon_(New_Earth)   
4     1576  Richard Grayson (New Earth)  \/wiki\/Richard_Grayson_(New_Earth)   

                ID            ALIGN         EYE        HAIR              SEX  \
0  Secret Identity  Good Characters   Blue Eyes  Black Hair  Male Characters   
1  Secret Identity  Good Characters   Blue Eyes  Black Hair  Male Characters   
2  Secret Identity  Good Characters  Brown Eyes  Brown Hair  Male Characters   
3  Public Identity  Good Characters  Brown Eyes  White Hair  Male Characters   
4  Secret Identity  Good Characters   Blue Eyes  Black Hair  Male Characters   

   GSM              ALIVE  APPEARANCES FIRST APPEARANCE    YEAR  
0  NaN  Living Characters       3093.0        1939, May  1939.0  
1  NaN  Living Characters       2496.0    1986, October  1986.0  
2  NaN  Living Characters       1565.0    1959, October  1959.0  
3  NaN  Living Characters       1316.0   1987, February  1987.0  
4  NaN  Living Characters       1237.0      1940, April  1940.0  
marvel.shape
16376 13
dc.shape
6896 13
marvel.columns
Index(['page_id', 'name', 'urlslug', 'ID', 'ALIGN', 'EYE', 'HAIR', 'SEX',
       'GSM', 'ALIVE', 'APPEARANCES', 'FIRST APPEARANCE', 'Year'],
      dtype='object')
dc.columns
Index(['page_id', 'name', 'urlslug', 'ID', 'ALIGN', 'EYE', 'HAIR', 'SEX',
       'GSM', 'ALIVE', 'APPEARANCES', 'FIRST APPEARANCE', 'YEAR'],
      dtype='object')

Both datasets contain the same number of features however, we do need to make sure that the columns are named the same way prior to concatenating them together. We also add a universe column so that we can identify which dataset the characters came from prior to contating.

marvel['universe'] = 'marvel'
marvel.columns = marvel.columns.str.lower()

dc['universe'] = 'dc'
dc.columns = dc.columns.str.lower()
comic = pd.concat([marvel, dc], ignore_index=True)
comic.shape
23272 14
comic.dtypes
page_id               int64
name                 object
urlslug              object
id                   object
align                object
eye                  object
hair                 object
sex                  object
gsm                  object
alive                object
appearances         float64
first appearance     object
year                float64
universe             object
dtype: object

We have a mix of text and numerical features. Let's examine them closely.

2.1.1. Handling redundant columns

urlslug does not add any new information and can be dropped.

comic = comic.drop('urlslug', axis='columns')
comic.shape
23272 13

page_id provides a unique id for the characters within the wiki (website) from where the data was scrapped. Let's investigate this feature a bit further.

page_id = comic['page_id']
page_id[page_id.duplicated(keep=False)]
3        1868
7        1833
9        1837
11       1863
13       2614
         ... 
23253    1485
23257    1624
23259    1581
23260    1473
23261    1460
Name: page_id, Length: 952, dtype: int64

Looks like we have lots of duplicates, let's look at a few duplicated examples.

comic[page_id.eq(1868)]
       page_id                               name               id  \
3         1868  Iron Man (Anthony \"Tony\" Stark)  Public Identity   
16491     1868         James Corrigan (New Earth)  Secret Identity   

                 align        eye        hair              sex  gsm  \
3      Good Characters  Blue Eyes  Black Hair  Male Characters  NaN   
16491  Good Characters  Blue Eyes    Red Hair  Male Characters  NaN   

                     alive  appearances first appearance    year universe  
3        Living Characters       2961.0           Mar-63  1963.0   marvel  
16491  Deceased Characters        214.0   1940, February  1940.0       dc  

Characters can appear on the same page which makes sense. However, the page_id for dc and marvel characters most likely are not the same. I don't think this feature adds any valuable information, so we drop it as well.

comic = comic.drop('page_id',axis='columns')
comic.shape
23272 12

In a sense, name also serves as a unique identifier (although we can have same names). We may choose to treat this as a categorical feature or a text feature. Either way, it requires some data wrangling before extracting useful information. This is beyond the scope of this project so we drop name.

comic = comic.drop('name', axis='columns')
comic.shape
23272 11

2.1.2. Handling numerical features

appearances & year are represented as floats but should be ints.

numerical_features = ['appearances',
                      'year']

for feature in numerical_features:
    comic[feature] = comic[feature].astype('int')

2.1.3. Handling categorical features

id, align, eye, hair, sex, gsm & alive are categorical features and should be represented as the category dtype.

categorical_features = ['id',
                        'align',
                        'eye',
                        'hair',
                        'sex',
                        'gsm',
                        'alive']

for feature in categorical_features:
    print(comic[feature].astype('category').cat.categories)

Index(['Identity Unknown', 'Known to Authorities Identity', 'No Dual Identity',
       'Public Identity', 'Secret Identity'],
      dtype='object')
Index(['Bad Characters', 'Good Characters', 'Neutral Characters',
       'Reformed Criminals'],
      dtype='object')
Index(['Amber Eyes', 'Auburn Hair', 'Black Eyeballs', 'Black Eyes',
       'Blue Eyes', 'Brown Eyes', 'Compound Eyes', 'Gold Eyes', 'Green Eyes',
       'Grey Eyes', 'Hazel Eyes', 'Magenta Eyes', 'Multiple Eyes', 'No Eyes',
       'One Eye', 'Orange Eyes', 'Photocellular Eyes', 'Pink Eyes',
       'Purple Eyes', 'Red Eyes', 'Silver Eyes', 'Variable Eyes',
       'Violet Eyes', 'White Eyes', 'Yellow Eyeballs', 'Yellow Eyes'],
      dtype='object')
Index(['Auburn Hair', 'Bald', 'Black Hair', 'Blond Hair', 'Blue Hair',
       'Bronze Hair', 'Brown Hair', 'Dyed Hair', 'Gold Hair', 'Green Hair',
       'Grey Hair', 'Light Brown Hair', 'Magenta Hair', 'No Hair',
       'Orange Hair', 'Orange-brown Hair', 'Pink Hair', 'Platinum Blond Hair',
       'Purple Hair', 'Red Hair', 'Reddish Blond Hair', 'Reddish Brown Hair',
       'Silver Hair', 'Strawberry Blond Hair', 'Variable Hair', 'Violet Hair',
       'White Hair', 'Yellow Hair'],
      dtype='object')
Index(['Agender Characters', 'Female Characters', 'Genderfluid Characters',
       'Genderless Characters', 'Male Characters', 'Transgender Characters'],
      dtype='object')
Index(['Bisexual Characters', 'Genderfluid Characters',
       'Homosexual Characters', 'Pansexual Characters',
       'Transgender Characters', 'Transvestites'],
      dtype='object')
Index(['Deceased Characters', 'Living Characters'], dtype='object')

No string sanitation required, let's convert them to category dtype.

for feature in categorical_features:
    comic[feature] = comic[feature].astype('category')

eye & hair contain a lot of values, we may consider binning them.

2.1.4. Handling datetime features

first appearance contains a timestamp in a custom format. The format seems to be month-year where month is the 3 character abbreviation and year contains only the last 2 digits. Since we already have the full year in the year column, we can consider extracting the month and creating a new month column. We can then drop the first appearance column.

comic['month'] = comic['first appearance'].str.extract('(\w+)')
comic['month'].astype('category').cat.categories
Index(['1935', '1936', '1937', '1938', '1939', '1940', '1941', '1942', '1943',
       '1944', '1945', '1946', '1947', '1948', '1949', '1950', '1951', '1952',
       '1953', '1954', '1955', '1956', '1957', '1958', '1959', '1960', '1961',
       '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970',
       '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979',
       '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988',
       '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997',
       '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006',
       '2007', '2008', '2009', '2010', '2011', '2012', '2013', 'Apr', 'Aug',
       'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep'],
      dtype='object')

It appears extracting the month requires a bit more work than anticipated. As this won't aid in our analysis, we simply drop the feature for now.

comic = comic.drop(['first appearance', 'month'], axis='columns')
comic.shape
23272 10

2.1.5. Descriptive statistics, missing & duplicates

Let's look at the descriptive statistics next.

comic.describe(include='all')
                     id           align        eye        hair  \
count             17489           19859       9877       16734   
unique                5               4         26          28   
top     Secret Identity  Bad Characters  Blue Eyes  Black Hair   
freq               8683            9615       3064        5329   
mean                NaN             NaN        NaN         NaN   
std                 NaN             NaN        NaN         NaN   
min                 NaN             NaN        NaN         NaN   
25%                 NaN             NaN        NaN         NaN   
50%                 NaN             NaN        NaN         NaN   
75%                 NaN             NaN        NaN         NaN   
max                 NaN             NaN        NaN         NaN   

                    sex                    gsm              alive  \
count             22293                    154              23266   
unique                6                      6                  2   
top     Male Characters  Homosexual Characters  Living Characters   
freq              16421                    120              17808   
mean                NaN                    NaN                NaN   
std                 NaN                    NaN                NaN   
min                 NaN                    NaN                NaN   
25%                 NaN                    NaN                NaN   
50%                 NaN                    NaN                NaN   
75%                 NaN                    NaN                NaN   
max                 NaN                    NaN                NaN   

         appearances          year universe  
count   21821.000000  22388.000000    23272  
unique           NaN           NaN        2  
top              NaN           NaN   marvel  
freq             NaN           NaN    16376  
mean       19.009303   1986.420046      NaN  
std        93.814040     18.972698      NaN  
min         1.000000   1935.000000      NaN  
25%         1.000000   1976.000000      NaN  
50%         4.000000   1990.000000      NaN  
75%        10.000000   2001.000000      NaN  
max      4043.000000   2013.000000      NaN  

Let's check for missing values next.

comic[comic.isna().any(axis='columns')].shape
23176 10
comic.isna().sum()
id              5783
align           3413
eye            13395
hair            6538
sex              979
gsm            23118
alive              6
appearances     1451
year             884
universe           0
dtype: int64

~99% of the data is missing but this is primarily due to the gsm column. Even if we drop this column, other features still contain too many missing values thus we have to consider imputation.

Let's check for duplicates next.

comic[comic.duplicated(keep=False)]
                    id            align        eye         hair  \
1542   Secret Identity   Bad Characters        NaN   Brown Hair   
1544   Secret Identity   Bad Characters        NaN   Brown Hair   
1736   Public Identity  Good Characters  Blue Eyes  Auburn Hair   
1737   Public Identity  Good Characters  Blue Eyes  Auburn Hair   
1824   Secret Identity  Good Characters        NaN     Red Hair   
...                ...              ...        ...          ...   
23264  Public Identity  Good Characters        NaN          NaN   
23266  Public Identity  Good Characters        NaN          NaN   
23268  Public Identity  Good Characters        NaN          NaN   
23269  Public Identity  Good Characters        NaN          NaN   
23270  Public Identity  Good Characters        NaN          NaN   

                     sex  gsm                alive  appearances    year  \
1542     Male Characters  NaN    Living Characters         24.0  1993.0   
1544     Male Characters  NaN    Living Characters         24.0  1993.0   
1736     Male Characters  NaN  Deceased Characters         21.0  1986.0   
1737     Male Characters  NaN  Deceased Characters         21.0  1986.0   
1824   Female Characters  NaN    Living Characters         20.0  1994.0   
...                  ...  ...                  ...          ...     ...   
23264    Male Characters  NaN    Living Characters          NaN     NaN   
23266    Male Characters  NaN    Living Characters          NaN     NaN   
23268    Male Characters  NaN    Living Characters          NaN     NaN   
23269    Male Characters  NaN    Living Characters          NaN     NaN   
23270    Male Characters  NaN    Living Characters          NaN     NaN   

      universe  
1542    marvel  
1544    marvel  
1736    marvel  
1737    marvel  
1824    marvel  
...        ...  
23264       dc  
23266       dc  
23268       dc  
23269       dc  
23270       dc  

[4226 rows x 10 columns]

We have duplicate rows but this is possible since several characters may posses similar looks (eye & hair color). No steps necessary for this dataset.

2.1.6. Correlations

There is no point checking for correlations here since we only have 2 numerical features.

Date: 2021-11-06 Sat 00:00

Created: 2022-01-25 Tue 15:47