首页 > 其他分享> 文章详细

Histopathologic Cancer Detection（densenet169）学习笔记

2021-10-02 09:58:05 阅读：497 来源： 互联网

标签：img Cancer df learner Detection preds train Histopathologic test

1、kaggle比赛地址

Histopathologic Cancer Detection | Kaggle

2、说明

这是一个二元图像分类问题。先看下官方给出的数据集构成。

train：训练集。一些病理学图像，.tif文件，大小96 x 96px。（每个样本称为patch）

test：测试集。图像格式和训练集的相同。

train_labels.csv：训练集的标签。以训练数据集的每个样本的文件名作为id，0或者1作为label。0表示负样本（正常，没有癌症），1表示正样本（有癌症）。其中正样本的定义是：图像的中心区域（32 x 32px）至少有一个像素的肿瘤组织，而中心区域以外如果有肿瘤组织，则不影响标签。提供外部区域，这样在做卷积的时候就不再需要做填充了，方便训练。

sample_submission.csv：给出了测试集中的每个样本的id（文件名），后面的label暂时都是0。

任务：从96 x 96px的数字组织病理学图像中识别出是否存在转移灶。也就是对于测试集中的每个id（patch），必须预测patch的中心32x32px区域“至少包含一个像素肿瘤组织”的概率。

3、数据集背景

kaggle比赛用的数据集是PCam数据集的一个子集，它将Pcam中重复的图像删除了。而PCam，也就是Patch Camelyon，来自Camelyon16挑战赛数据集。可以理解为原来的图像是非常大的，将它切片（相当于已经做了预处理）以后，得到的Patch Camelyon。下面三个链接是对camelyon数据集的介绍，可以了解处理病理学图像（WSI）的流程。

camelyon数据集介绍：

https://zhuanlan.zhihu.com/p/50672544

camelyon16冠军方案：

https://zhuanlan.zhihu.com/p/51247262

camelyon17冠军方案：

https://zhuanlan.zhihu.com/p/51735826

4、平台

本来想把kaggle上的数据集通过API放在colab上运行，遇到两个问题。1.colab每次重启以后，数据集都会清空。2.实际运行的时候，内存消耗太大，没办法跑完。所以后来还是在实验室的服务器上跑。

5、参考代码

原作者代码地址：

A complete ML pipeline (Fast.ai) | Kaggle

在这个基础上，把可视化的部分删去了，路径修改一下，注意安装环境时fastai的版本应该就可以跑通。

6、安装环境

在服务器上装了anaconda，然后conda安装了各个库/包，应该都没什么问题。关于fastai的安装，试了一下conda用命令安装一直不成功。即使成功了运行代码时也会有一些报错。最后：

pip install fastai==1.0.50.post1

跑着就顺利了。

6、修改后的乞丐脚本（路径需要修改）

import numpy as np
import pandas as pd
import os
import cv2
from sklearn.utils import shuffle
from tqdm.notebook import tqdm

data = pd.read_csv('/kaggle/input/train_labels.csv')
train_path = '/kaggle/input/train/'
test_path = '/kaggle/input/test/'

# random sampling
shuffled_data = shuffle(data)

# 数据增强
import random

ORIGINAL_SIZE = 96  # 原始尺寸 original size of the images - do not change

# AUGMENTATION VARIABLES 增强变量
CROP_SIZE = 90  # final size after crop 剪裁后的尺寸 输入尺寸？
RANDOM_ROTATION = 3  # range (0-180), 180 allows all rotation variations, 0=no change 旋转角度 0 90 180
RANDOM_SHIFT = 2  # center crop shift in x and y axes, 0=no change. This cannot be more than (ORIGINAL_SIZE 96 - CROP_SIZE 90)//2 = 3 转变？
RANDOM_BRIGHTNESS = 7  # range (0-100), 0=no change 增加亮度
RANDOM_CONTRAST = 5  # range (0-100), 0=no change 增加对比度
RANDOM_90_DEG_TURN = 1  # 0 or 1= random turn to left or right ？


def readCroppedImage(path, augmentations=True):
    # augmentations parameter is included for counting statistics from images, where we don't want augmentations
    # 统计的时候不需要增强

    # OpenCV reads the image in bgr format by default 用opencv处理图片
    bgr_img = cv2.imread(path)
    # We flip it to rgb for visualization purposes
    b, g, r = cv2.split(bgr_img)
    rgb_img = cv2.merge([r, g, b])

    if (not augmentations):
        return rgb_img / 255

    # random rotation 随机旋转 旋转 0 90 180
    rotation = random.randint(-RANDOM_ROTATION, RANDOM_ROTATION)
    if (RANDOM_90_DEG_TURN == 1):
        rotation += random.randint(-1, 1) * 90
    M = cv2.getRotationMatrix2D((48, 48), rotation, 1)  # the center point is the rotation anchor 中心点是旋转锚
    rgb_img = cv2.warpAffine(rgb_img, M, (96, 96))

    # random x,y-shift x y 坐标互换？ 随机翻转？
    x = random.randint(-RANDOM_SHIFT, RANDOM_SHIFT)
    y = random.randint(-RANDOM_SHIFT, RANDOM_SHIFT)

    # crop to center and normalize to 0-1 range 裁剪到中心并归一为0-1的范围
    start_crop = (ORIGINAL_SIZE - CROP_SIZE) // 2 #（96-90）// 2 整数除 应该得到3
    end_crop = start_crop + CROP_SIZE # 90+3 = 93
    rgb_img = rgb_img[(start_crop + x):(end_crop + x), (start_crop + y):(end_crop + y)] / 255

    # Random flip 随机翻转
    flip_hor = bool(random.getrandbits(1))
    flip_ver = bool(random.getrandbits(1))
    if (flip_hor):
        rgb_img = rgb_img[:, ::-1]
    if (flip_ver):
        rgb_img = rgb_img[::-1, :]

    # Random brightness 随机亮度
    br = random.randint(-RANDOM_BRIGHTNESS, RANDOM_BRIGHTNESS) / 100.
    rgb_img = rgb_img + br

    # Random contrast 随机对比度
    cr = 1.0 + random.randint(-RANDOM_CONTRAST, RANDOM_CONTRAST) / 100.
    rgb_img = rgb_img * cr

    # clip values to 0-1 range 将数值夹在0-1范围内
    rgb_img = np.clip(rgb_img, 0, 1.0)

    return rgb_img  # 返回处理好的图片


# 计算图像统计数据（在这里不要用增强）
# 计算统计数据将给出通道平均数[0.702447, 0.546243, 0.696453]，以及标准差[0.238893, 0.282094, 0.216251]。

# As we count the statistics, we can check if there are any completely black or white images
# 当我们统计时，我们可以检查是否有任何完全黑色或白色的图像

dark_th = 10 / 255  # If no pixel reaches this threshold, image is considered too dark 如果没有像素达到这个阈值，则认为图像太暗。
bright_th = 245 / 255  # If no pixel is under this threshold, image is considerd too bright 如果没有像素达到这个阈值，则认为图像太亮。
too_dark_idx = []
too_bright_idx = []

x_tot = np.zeros(3)
x2_tot = np.zeros(3)
counted_ones = 0
for i, idx in tqdm(enumerate(shuffled_data['id']), 'computing statistics...(220025 it total)'):
    path = os.path.join(train_path, idx)
    imagearray = readCroppedImage(path + '.tif', augmentations=False).reshape(-1, 3)
    # is this too dark 判断是否太暗
    if (imagearray.max() < dark_th):
        too_dark_idx.append(idx)
        continue  # do not include in statistics 太暗的数据不要加入到统计数据里面
    # is this too bright 判断是否太亮
    if (imagearray.min() > bright_th):
        too_bright_idx.append(idx)
        continue  # do not include in statistics 太亮的数据不要加入到统计数据里面
    x_tot += imagearray.mean(axis=0)
    x2_tot += (imagearray ** 2).mean(axis=0)
    counted_ones += 1

# 计算通道的均值和标准差
channel_avr = x_tot / counted_ones
channel_std = np.sqrt(x2_tot / counted_ones - channel_avr ** 2)

# 结果应该是1张过暗，6张过亮
print('There was {0} extremely dark image'.format(len(too_dark_idx)))
print('and {0} extremely bright images'.format(len(too_bright_idx)))
print('Dark one:')
print(too_dark_idx)
print('Bright ones:')
print(too_bright_idx)

#  划分数据集
#  将训练数据分成90%的训练部分和10%的验证部分。我们希望在训练和测试两部分中保持相等的阴性/阳性比例（60/40）

from sklearn.model_selection import train_test_split

# we read the csv file earlier to pandas dataframe, now we set index to id so we can perform
# 我们先前将csv文件读取到pandas数据框架中，现在我们将索引设置为id，这样我们就可以执行
train_df = data.set_index('id')

# If removing outliers, uncomment the four lines below
# 移除异常值
# 移除以前，有多少样本；移出过暗的/过亮的以后，有多少样本
print('Before removing outliers we had {0} training samples.'.format(train_df.shape[0]))
train_df = train_df.drop(labels=too_dark_idx, axis=0)
train_df = train_df.drop(labels=too_bright_idx, axis=0)
print('After removing outliers we have {0} training samples.'.format(train_df.shape[0]))

train_names = train_df.index.values
train_labels = np.asarray(train_df['label'].values)

# split, this function returns more than we need as we only need the validation indexes for fastai
# 分割，这个函数返回的数据比我们需要的多，因为我们只需要fastai的验证索引。
tr_n, tr_idx, val_n, val_idx = train_test_split(train_names, range(len(train_names)), test_size=0.1,
                                                stratify=train_labels, random_state=123)
# 使用fastai库
# fastai 1.0
from fastai import *
from fastai.vision import *
from torchvision.models import *    # import *=all the models from torchvision

# 设置超参数
arch = densenet169                  # specify model architecture, densenet169 seems to perform well for this data but you could experiment(实验）
BATCH_SIZE = 128                    # specify batch size, hardware restrics (硬件会限制这个参数）this one. Large batch sizes may run out of GPU memory(过大会占据完GPU内存）
sz = CROP_SIZE                      # input size is the crop size (输入大小（输入进网络的大小）
MODEL_PATH = str(arch).split()[1]   # this will extract the model name as the model file name e.g. 'resnet50' 提取模型名称作为模型文件的名称

# 我们将图像加载到一个ImageDataBunch中进行训练。这个fastai的数据对象很容易定制，可以使用我们自己的readCroppedImage函数加载图像。我们只需要对ImageList进行子类化。
# create dataframe for the fastai loader
# dataframe

# 训练字典（文件+标签）
train_dict = {'name': train_path + train_names, 'label': train_labels}
df = pd.DataFrame(data=train_dict)
# create test dataframe 创建 test dataframe
# test_names放着测试文件的路径名
test_names = []
for f in os.listdir(test_path):
    test_names.append(test_path + f)
df_test = pd.DataFrame(np.asarray(test_names), columns=['name'])

# 继承
# Subclass ImageList to use our own image opening function
class MyImageItemList(ImageList):
    def open(self, fn:PathOrStr)->Image:
        img = readCroppedImage(fn.replace('/./','').replace('//','/'))
        # This ndarray image has to be converted to tensor before passing on as fastai Image, we can use pil2tensor
        # 这个ndarray图像在作为fastai图像传递之前必须转换为张量，我们可以使用pil2tensor
        return vision.Image(px=pil2tensor(img, np.float32))

# Create ImageDataBunch using fastai data block API
# 数据束 创建databunch
imgDataBunch = (MyImageItemList.from_df(path='/', df=df, suffix='.tif')
        #Where to find the data? 打开path路径中的后缀为.tif文件，dataframe为df
        .split_by_idx(val_idx)
        #How to split in train/valid?
        .label_from_df(cols='label')
        #Where are the labels?
        .add_test(MyImageItemList.from_df(path='/', df=df_test))
        #dataframe pointing to the test set?
        .transform(tfms=[[],[]], size=sz)
        # We have our custom transformations implemented in the image loader but we could apply transformations also here
        # Even though we don't apply transformations here, we set two empty lists to tfms. Train and Validation augmentations
        .databunch(bs=BATCH_SIZE)
        # convert to databunch
        .normalize([tensor([0.702447, 0.546243, 0.696453]), tensor([0.238893, 0.282094, 0.216251])])
        # Normalize with training set stats. These are means and std's of each three channel and we calculated these previously in the stats step.
       )

# 训练
# Next, we create a convnet learner object
# ps = dropout percentage (0-1) in the final layer

# 我们定义了一个convnet学习者对象，在这里我们设置了模型架构和我们的数据束。data bunch
# fastai的create_cnn

def getLearner():
    return create_cnn(imgDataBunch, arch, pretrained=True, path='.', metrics=accuracy, ps=0.5, callback_fns=ShowGraph)

# 构造了一个学习器
learner = getLearner()


# 1周期策略
# We can use lr_find with different weight decays and record all losses so that we can plot them on the same graph
# Number of iterations is by default 100, but at this low number of itrations, there might be too much variance
# from random sampling that makes it difficult to compare WD's. I recommend using an iteration count of at least 300 for more consistent results.
# 我们可以使用不同权重衰减的lr_find，并记录所有的损失，这样我们就可以把它们绘制在同一张图上。
# 迭代次数默认为100次，但在这么低的迭代次数下，可能会有太多的差异
# 来自随机抽样，使我们难以比较WD的情况。我建议使用至少300次的迭代次数以获得更一致的结果。

lrs = []
losses = []
wds = [] #weight decay
iter_count = 600

# WEIGHT DECAY = 1e-6 权重下降
learner.lr_find(wd=1e-6, num_it=iter_count)
lrs.append(learner.recorder.lrs)
losses.append(learner.recorder.losses)
wds.append('1e-6')
learner = getLearner() #reset learner - this gets more consistent starting conditions #重设学习者--这可以获得更一致的启动条件

# WEIGHT DECAY = 1e-4
learner.lr_find(wd=1e-4, num_it=iter_count)
lrs.append(learner.recorder.lrs)
losses.append(learner.recorder.losses)
wds.append('1e-4')
learner = getLearner() #reset learner - this gets more consistent starting conditions

# WEIGHT DECAY = 1e-2
learner.lr_find(wd=1e-2, num_it=iter_count)
lrs.append(learner.recorder.lrs)
losses.append(learner.recorder.losses)
wds.append('1e-2')
learner = getLearner() #reset learner

max_lr = 2e-2
wd = 1e-4

interp = ClassificationInterpretation.from_learner(learner)

learner.save(MODEL_PATH + '_stage1')


# 对基线模型进行微调 （更低的lr）
# 接下来，我们可以解冻模型的所有可训练参数，并继续其训练。
# 这个模型已经表现得很好了，现在，当我们解冻已经用大量的一般图像预先训练过的底层，以检测常见的形状和模式时，所有的权重大多被调整了。我们现在应该以更低的学习率进行训练。

# load the baseline model
learner.load(MODEL_PATH + '_stage1')

# unfreeze and run learning rate finder again 解冻可训练参数
learner.unfreeze()
learner.lr_find(wd=wd)

# Now, smaller learning rates. This time we define the min and max lr of the cycle
learner.fit_one_cycle(cyc_len=12, max_lr=slice(4e-5,4e-4))


interp = ClassificationInterpretation.from_learner(learner)

# 存储为第二个模型
learner.save(MODEL_PATH + '_stage2')



# Validation and analysis 验证和分析
preds,y, loss = learner.get_preds(with_loss=True)
# get accuracy
acc = accuracy(preds, y)
print('The accuracy is {0} %.'.format(acc))


# roc和auc：记住，AUC是用于评估提交的指标。我们可以在这里计算它的验证集，但它很可能会与最终的分数不同。got

from sklearn.metrics import roc_curve, auc
# probs from log preds e为底的指数函数 preds,y, loss = learner.get_preds(with_loss=True)
probs = np.exp(preds[:,1])
# Compute ROC curve 计算roc曲线
fpr, tpr, thresholds = roc_curve(y, probs, pos_label=1)

# Compute ROC area 计算roc面积（auc）
roc_auc = auc(fpr, tpr)
print('ROC area is {0}'.format(roc_auc))

# 应该得到0.99


# TTA
# 为了评估该模型，我们在所有测试图像上运行推理。由于我们有测试时间的增强，如果我们对每张图像进行多次预测并对结果进行平均，我们的结果可能会有所改善。

# make sure we have the best performing model stage loaded
# 加载效果最好的模型
learner.load(MODEL_PATH + '_stage2')

# Fastai has a function for this but we don't want the additional augmentations it does (our image loader has augmentations) so we just use the get_preds
#preds_test,y_test=learner.TTA(ds_type=DatasetType.Test)

# We do a fair number of iterations to cover different combinations of flips and rotations.
# The predictions are then averaged.
# 我们做了相当多的迭代，以涵盖不同的翻转和旋转组合。
# 然后对预测结果进行平均化。

n_aug = 12
preds_n_avg = np.zeros((len(learner.data.test_ds.items),2))
for n in tqdm(range(n_aug), 'Running TTA...'):
    preds,y = learner.get_preds(ds_type=DatasetType.Test, with_loss=False)
    preds_n_avg = np.sum([preds_n_avg, preds.numpy()], axis=0)
preds_n_avg = preds_n_avg / n_aug

# Next, we will transform class probabilities to just tumor class probabilities
# 接下来，我们将把类的概率转化为只是肿瘤类的概率
print('Negative and Tumor Probabilities: ' + str(preds_n_avg[0]))
tumor_preds = preds_n_avg[:, 1]
print('Tumor probability: ' + str(tumor_preds[0]))
# If we wanted to get the predicted class, argmax would get the index of the max
# 如果我们想得到预测的类别，argmax会得到最大的索引。
class_preds = np.argmax(preds_n_avg, axis=1)
classes = ['Negative','Tumor']
print('Class prediction: ' + classes[class_preds[0]])

# 提交对模型的评估；对每一个测试样本，给出0-1的癌症可能性

# get test id's from the sample_submission.csv and keep their original order 得到id并保留他们原始的顺序
SAMPLE_SUB = '/home/wls/kaggle/input/sample_submission.csv'
sample_df = pd.read_csv(SAMPLE_SUB)
sample_list = list(sample_df.id)

# List of tumor preds.
# These are in the order of our test dataset and not necessarily in the same order as in sample_submission
pred_list = [p for p in tumor_preds]

# To know the id's, we create a dict of id:pred
pred_dic = dict((key, value) for (key, value) in zip(learner.data.test_ds.items, pred_list))

# Now, we can create a new list with the same order as in sample_submission
pred_list_cor = [pred_dic['/home/wls/kaggle/input/test/' + id + '.tif'] for id in sample_list]

# Next, a Pandas dataframe with id and label columns.
df_sub = pd.DataFrame({'id':sample_list,'label':pred_list_cor})

# Export to csv
df_sub.to_csv('{0}_submission.csv'.format(MODEL_PATH), header=True, index=False)

7、实验结果

标签：img,Cancer,df,learner,Detection,preds,train,Histopathologic,test
来源： https://blog.csdn.net/odssodssey/article/details/120581303

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9