HAM10000

Skin Cancer MNIST: HAM10000 a large collection of multi-source dermatoscopic images of pigmented lesions

https://www.kaggle.com/kmader/skin-cancer-mnist-ham10000

Categories:

actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec)

basal cell carcinoma (bcc)

benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl)

dermatofibroma (df)

melanoma (mel)

melanocytic nevi (nv)

vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc)

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
In [2]:
from fastai.vision import *
from fastai.metrics import accuracy
In [3]:
PATH = "/home/katey/DeepLearning/Data/HAM10000/"
In [4]:
label_csv = f'{PATH}HAM10000_metadata.csv'
In [5]:
label_df = pd.read_csv(label_csv)
In [6]:
label_df.head()
Out[6]:
lesion_id image_id dx dx_type age sex localization
0 HAM_0000118 ISIC_0027419 bkl histo 80.0 male scalp
1 HAM_0000118 ISIC_0025030 bkl histo 80.0 male scalp
2 HAM_0002730 ISIC_0026769 bkl histo 80.0 male scalp
3 HAM_0002730 ISIC_0025661 bkl histo 80.0 male scalp
4 HAM_0001466 ISIC_0031633 bkl histo 75.0 male ear

Number of images

In [7]:
len(label_df.image_id.unique())
Out[7]:
10015

Number of lesions

In [8]:
len(label_df.lesion_id.unique())
Out[8]:
7470

Note that perhaps a quarter of lesions have multiple images. We need to ensure that images from the same lesion do not appear in both the training and validation sets. Take a random sample of the unique lesion ids

In [12]:
np.random.seed(827)
val_lesions = list(np.random.choice(label_df.lesion_id.unique(),size = 3000))
In [13]:
val_idxs = label_df[label_df['lesion_id'].isin(val_lesions)].index
In [14]:
len(val_idxs)
Out[14]:
3303

Make df with just the image-id and dx

In [15]:
reduce_label_df = label_df.drop(columns = ['lesion_id','dx_type','age','sex','localization'])
In [16]:
reduce_label_df.columns = ['filename','label']
In [17]:
reduce_label_df.head()
Out[17]:
filename label
0 ISIC_0027419 bkl
1 ISIC_0025030 bkl
2 ISIC_0026769 bkl
3 ISIC_0025661 bkl
4 ISIC_0031633 bkl

Count how many lesions there are of each type

In [18]:
reduce_label_df.pivot_table(index="label", aggfunc=len).sort_values('filename', ascending=False)
Out[18]:
filename
label
nv 6705
mel 1113
bkl 1099
bcc 514
akiec 327
vasc 142
df 115

How many lesions there are of each type in the validation set

In [19]:
reduce_label_df.iloc[val_idxs].pivot_table(index="label", aggfunc=len).sort_values('filename', ascending=False)
Out[19]:
filename
label
nv 2199
bkl 366
mel 347
bcc 194
akiec 114
vasc 54
df 29
In [20]:
label_csv = f'{PATH}labels.csv'
reduce_label_df.to_csv(label_csv, index = False)
In [21]:
tfms = get_transforms(flip_vert = True)
In [22]:
bs = 64
sz = 64
In [23]:
src = (ImageItemList.from_csv(PATH, 'labels.csv', folder = 'train', suffix = '.jpg')
        .split_by_idx(val_idxs)
        .label_from_df())
In [24]:
data = (src.transform(tfms, size=sz)
        .databunch(bs=bs).normalize(imagenet_stats))
In [36]:
data.show_batch(rows=5, figsize=(12,10))
In [27]:
learn = create_cnn(data, models.resnet34, metrics=accuracy)
In [28]:
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
In [29]:
lr = 1e-2
In [30]:
learn.fit_one_cycle(5, lr)
Total time: 02:55

epoch train_loss valid_loss accuracy
1 1.186675 0.858878 0.710263
2 0.903502 0.738203 0.734484
3 0.769199 0.687939 0.752346
4 0.693994 0.651152 0.761732
5 0.661096 0.639313 0.769301
In [31]:
learn.unfreeze()
In [32]:
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
In [33]:
learn.fit_one_cycle(5, slice(1e-5, 1e-4))
Total time: 03:02

epoch train_loss valid_loss accuracy
1 0.650552 0.619801 0.779897
2 0.631828 0.603539 0.780805
3 0.582781 0.580451 0.793218
4 0.560097 0.574121 0.795338
5 0.532885 0.569416 0.801695
In [ ]:
learn.save('mod-resnet34-sz64')
In [35]:
learn.fit_one_cycle(10, slice(1e-5, 1e-4/2))
Total time: 06:30

epoch train_loss valid_loss accuracy
1 0.542999 0.571448 0.798971
2 0.544735 0.566700 0.800484
3 0.552432 0.563527 0.794732
4 0.512192 0.560285 0.796549
5 0.510227 0.558457 0.802906
6 0.497188 0.552658 0.797154
7 0.477804 0.555176 0.799576
8 0.453000 0.548926 0.797457
9 0.450656 0.551730 0.797154
10 0.452580 0.552420 0.796549

Resize images to 128

In [37]:
sz = 128
In [38]:
data = (src.transform(tfms, size=sz)
        .databunch(bs=bs).normalize(imagenet_stats))
In [39]:
data.show_batch(rows=5, figsize=(12,10))