HAM10000

Skin Cancer MNIST: HAM10000 a large collection of multi-source dermatoscopic images of pigmented lesions

https://www.kaggle.com/kmader/skin-cancer-mnist-ham10000

Categories:

actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec)

basal cell carcinoma (bcc)

benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl)

dermatofibroma (df)

melanoma (mel)

melanocytic nevi (nv)

vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc)

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
In [2]:
from fastai.vision import *
from fastai.metrics import accuracy
In [3]:
PATH = "/home/katey/DeepLearning/Data/HAM10000/"
In [4]:
label_csv = f'{PATH}HAM10000_metadata.csv'
In [5]:
label_df = pd.read_csv(label_csv)
In [6]:
label_df.head()
Out[6]:
lesion_id image_id dx dx_type age sex localization
0 HAM_0000118 ISIC_0027419 bkl histo 80.0 male scalp
1 HAM_0000118 ISIC_0025030 bkl histo 80.0 male scalp
2 HAM_0002730 ISIC_0026769 bkl histo 80.0 male scalp
3 HAM_0002730 ISIC_0025661 bkl histo 80.0 male scalp
4 HAM_0001466 ISIC_0031633 bkl histo 75.0 male ear

Number of images

In [7]:
len(label_df.image_id.unique())
Out[7]:
10015

Number of lesions

In [8]:
len(label_df.lesion_id.unique())
Out[8]:
7470

Note that perhaps a quarter of lesions have multiple images. We need to ensure that images from the same lesion do not appear in both the training and validation sets. Take a random sample of the unique lesion ids

In [12]:
np.random.seed(827)
val_lesions = list(np.random.choice(label_df.lesion_id.unique(),size = 3000))
In [13]:
val_idxs = label_df[label_df['lesion_id'].isin(val_lesions)].index
In [14]:
len(val_idxs)
Out[14]:
3303

Make df with just the image-id and dx

In [15]:
reduce_label_df = label_df.drop(columns = ['lesion_id','dx_type','age','sex','localization'])
In [16]:
reduce_label_df.columns = ['filename','label']
In [17]:
reduce_label_df.head()
Out[17]:
filename label
0 ISIC_0027419 bkl
1 ISIC_0025030 bkl
2 ISIC_0026769 bkl
3 ISIC_0025661 bkl
4 ISIC_0031633 bkl

Count how many lesions there are of each type

In [18]:
reduce_label_df.pivot_table(index="label", aggfunc=len).sort_values('filename', ascending=False)
Out[18]:
filename
label
nv 6705
mel 1113
bkl 1099
bcc 514
akiec 327
vasc 142
df 115

How many lesions there are of each type in the validation set

In [19]:
reduce_label_df.iloc[val_idxs].pivot_table(index="label", aggfunc=len).sort_values('filename', ascending=False)
Out[19]:
filename
label
nv 2199
bkl 366
mel 347
bcc 194
akiec 114
vasc 54
df 29
In [20]:
label_csv = f'{PATH}labels.csv'
reduce_label_df.to_csv(label_csv, index = False)
In [21]:
tfms = get_transforms(flip_vert = True)
In [22]:
bs = 64
sz = 64
In [23]:
src = (ImageItemList.from_csv(PATH, 'labels.csv', folder = 'train', suffix = '.jpg')
        .split_by_idx(val_idxs)
        .label_from_df())
In [24]:
data = (src.transform(tfms, size=sz)
        .databunch(bs=bs).normalize(imagenet_stats))
In [36]:
data.show_batch(rows=5, figsize=(12,10))
In [27]:
learn = create_cnn(data, models.resnet34, metrics=accuracy)
In [28]:
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
In [29]:
lr = 1e-2
In [30]:
learn.fit_one_cycle(5, lr)
Total time: 02:55

epoch train_loss valid_loss accuracy
1 1.186675 0.858878 0.710263
2 0.903502 0.738203 0.734484
3 0.769199 0.687939 0.752346
4 0.693994 0.651152 0.761732
5 0.661096 0.639313 0.769301
In [31]:
learn.unfreeze()
In [32]:
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
In [33]:
learn.fit_one_cycle(5, slice(1e-5, 1e-4))
Total time: 03:02

epoch train_loss valid_loss accuracy
1 0.650552 0.619801 0.779897
2 0.631828 0.603539 0.780805
3 0.582781 0.580451 0.793218
4 0.560097 0.574121 0.795338
5 0.532885 0.569416 0.801695
In [ ]:
learn.save('mod-resnet34-sz64')
In [35]:
learn.fit_one_cycle(10, slice(1e-5, 1e-4/2))
Total time: 06:30

epoch train_loss valid_loss accuracy
1 0.542999 0.571448 0.798971
2 0.544735 0.566700 0.800484
3 0.552432 0.563527 0.794732
4 0.512192 0.560285 0.796549
5 0.510227 0.558457 0.802906
6 0.497188 0.552658 0.797154
7 0.477804 0.555176 0.799576
8 0.453000 0.548926 0.797457
9 0.450656 0.551730 0.797154
10 0.452580 0.552420 0.796549

Resize images to 128

In [37]:
sz = 128
In [38]:
data = (src.transform(tfms, size=sz)
        .databunch(bs=bs).normalize(imagenet_stats))
In [39]:
data.show_batch(rows=5, figsize=(12,10))
In [40]:
learn.data = data
data.train_ds[0][0].shape
Out[40]:
torch.Size([3, 128, 128])
In [41]:
learn.freeze()
In [42]:
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
In [43]:
lr=1e-3
In [44]:
learn.fit_one_cycle(5, lr)
Total time: 03:56

epoch train_loss valid_loss accuracy
1 0.593146 0.598795 0.782016
2 0.561943 0.569351 0.791099
3 0.546292 0.552666 0.798971
4 0.524803 0.556066 0.797760
5 0.512741 0.550822 0.800182
In [45]:
learn.unfreeze()
In [46]:
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
In [47]:
learn.fit_one_cycle(5, slice(1e-5, 1e-4))
Total time: 04:07

epoch train_loss valid_loss accuracy
1 0.521711 0.528288 0.805631
2 0.498406 0.503224 0.815319
3 0.450675 0.501416 0.818347
4 0.408610 0.499774 0.820466
5 0.386717 0.494122 0.818347
In [48]:
learn.save('mod-resnet34-sz128')
In [49]:
learn.fit_one_cycle(10, slice(1e-5, 1e-4/2))
Total time: 08:00

epoch train_loss valid_loss accuracy
1 0.387193 0.491476 0.819558
2 0.383850 0.497077 0.818044
3 0.364185 0.497040 0.818347
4 0.344495 0.497699 0.822283
5 0.333906 0.514679 0.818347
6 0.313578 0.509936 0.820769
7 0.297167 0.508580 0.821980
8 0.280208 0.506635 0.825613
9 0.270249 0.503174 0.825310
10 0.267939 0.504659 0.822586
In [50]:
learn.save('mod-resnet34-sz128')

Resize to 256

In [51]:
sz = 256
bs = 32
In [52]:
data = (src.transform(tfms, size=sz)
        .databunch(bs=bs).normalize(imagenet_stats))
In [53]:
data.show_batch(rows=5, figsize=(12,10))
In [111]:
learn.data = data
data.train_ds[0][0].shape
Out[111]:
torch.Size([3, 256, 256])
In [54]:
learn.freeze()
In [55]:
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
In [56]:
lr=1e-2
In [57]:
learn.fit_one_cycle(5, lr)
Total time: 04:54

epoch train_loss valid_loss accuracy
1 0.442882 0.552815 0.808053
2 0.463766 0.558989 0.808053
3 0.406110 0.491836 0.826824
4 0.353857 0.502712 0.829852
5 0.319887 0.486766 0.836815
In [58]:
learn.unfreeze()
In [59]:
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
In [60]:
learn.fit_one_cycle(5, slice(1e-5, 1e-4))
Total time: 05:42

epoch train_loss valid_loss accuracy
1 0.313277 0.490889 0.827732
2 0.306080 0.495885 0.837421
3 0.294801 0.487527 0.841356
4 0.247532 0.472140 0.844989
5 0.217842 0.462418 0.847109
In [61]:
learn.save('mod-resnet34-sz256')

Try a larger architecture - Resnet 50

In [62]:
learn = create_cnn(data, models.resnet50, metrics=accuracy)
In [63]:
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
In [64]:
lr = 1e-2
In [65]:
learn.fit_one_cycle(5, lr)
Total time: 08:21

epoch train_loss valid_loss accuracy
1 0.821564 0.747630 0.749622
2 0.737863 0.688934 0.749319
3 0.626133 0.548729 0.808053
4 0.522450 0.472274 0.830154
5 0.437214 0.432627 0.843173
In [66]:
learn.unfreeze()
In [67]:
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
In [68]:
learn.fit_one_cycle(5, slice(1e-5, 1e-4))
Total time: 10:20

epoch train_loss valid_loss accuracy
1 0.430947 0.446454 0.834393
2 0.423921 0.456202 0.839540
3 0.369562 0.410024 0.853467
4 0.355850 0.392643 0.854678
5 0.320526 0.397311 0.855889
In [69]:
learn.save('mod-resnet50-sz256')
In [70]:
learn.fit_one_cycle(10, slice(1e-5, 1e-4/2))
Total time: 20:49

epoch train_loss valid_loss accuracy
1 0.338695 0.389523 0.859522
2 0.331810 0.402507 0.848017
3 0.332133 0.464237 0.836210
4 0.339843 0.409317 0.853467
5 0.297568 0.400306 0.853769
6 0.283094 0.399265 0.856191
7 0.268882 0.408741 0.855283
8 0.258930 0.391421 0.854678
9 0.223862 0.392123 0.856494
10 0.233325 0.393627 0.856494
In [72]:
interp = ClassificationInterpretation.from_learner(learn)

losses,idxs = interp.top_losses()

len(data.valid_ds)==len(losses)==len(idxs)
Out[72]:
True
In [75]:
interp.plot_top_losses(16, figsize=(12,10))
In [76]:
interp.plot_confusion_matrix(figsize=(5,5), dpi=90)
In [77]:
interp.most_confused(min_val=2)
Out[77]:
[('mel', 'nv', 121),
 ('bkl', 'nv', 62),
 ('nv', 'bkl', 42),
 ('mel', 'bkl', 38),
 ('bkl', 'mel', 32),
 ('nv', 'mel', 23),
 ('bcc', 'nv', 18),
 ('akiec', 'bkl', 16),
 ('vasc', 'nv', 12),
 ('bkl', 'akiec', 11),
 ('akiec', 'bcc', 9),
 ('akiec', 'mel', 9),
 ('bcc', 'bkl', 9),
 ('df', 'nv', 9),
 ('akiec', 'nv', 8),
 ('bcc', 'mel', 8),
 ('bkl', 'bcc', 8),
 ('bcc', 'akiec', 7),
 ('mel', 'akiec', 7),
 ('nv', 'bcc', 5),
 ('bcc', 'df', 4),
 ('bkl', 'df', 3),
 ('nv', 'akiec', 3)]
In [ ]: