Attentional Factorization Machine（AFM）复现笔记

2021-12-30 22:32:14 阅读：231 来源： 互联网

标签：__ AFM nn self Factorization Machine att sparse

声明：本模型复现笔记记录自己学习过程，如果有错误请各位老师批评指正。

之前学习了很多关于特征交叉的模型比如Wide&Deep、Deep&Cross、DeepFM、NFM。
对于特征工程的特征交叉操作，这些模型已经做的非常好了，模型进一步提升的空间已经很小了，所以很多研究者继续探索更多"结构"上的尝试，比如注意力机制、序列模型、强化学习等在其他领域表现得很好的模型结构也逐渐进入推荐系统。

本周就学习了注意力机制在推荐系统中的应用。

注意力机制不懂的同学学一下李沐老师的注意力机制这一节课

AFM的贡献在于，将注意力机制引入到特征交叉的模块。回顾NFM，NFM在Bi-Interaction Layer中所有交叉项的权重都是一样的，而AFM在对二阶交叉特征进行加和池化（Sum Pooling）操作时，是使用注意力机制得到的注意力权重为所有交叉项分配权重。

NFM相当于“一视同仁”地对待所有交叉特征，不考虑不同特征对结果的影响程度，其实NFM浪费了很多有价值的信息。
注意力机制就可以解决这个“一视同仁”的问题。

举个例子：如果我们要预测一位男性用户是否会购买篮球鞋的可能性，那么“性别为男性 & 购买历史买过篮球”这一交叉特征，很可能比“性别为男性 & 用户年龄为25”这一交叉特征更重要，所以我们有必要在模型中投入更多的“注意力”，为那些比较重要的交叉特征分配更多的权重，为那些不是很重要的交叉特征分配更少的权重。

AFM公式：

在这里插入图片描述

乍一看很像FM模型，多了P向量和权重a i,j , 其实他就是对二阶交叉特征做了一个重要性选择。

另一个角度也是NFM的升级版，优化为把NFM的Bi-Interaction Pooling Layer的Sum Pooling（图1）修改为相应位置注意力权重（分数）的加权和（图2）。
在这里插入图片描述
图1

在这里插入图片描述
图2

两两特征交叉计算公式
注意力机制加权公式

设计模型

模型我还是设计的双塔模型，做了两个，一个是没有DNN的AFM，另一个是加了DNN的AFM，按照理论来说，后者要比前者的效果要好很多，但我用小数据训练出来后者要比前者的效果要差。（我不用大的数据集原因是GPU只有4G，容易卡掉）。

左塔为线性模型
在这里插入图片描述
右塔为带注意力的多层结构

不带DNN的AFM

在这里插入图片描述

#线性模型
class Linear(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(Linear,self).__init__()
        self.Linear = nn.Linear(input_dim, output_dim)
    def forward(self, x):
        output = self.Linear(x)
        return output
#注意力模型
class Attention_layer(nn.Module):
    def __init__(self, att_units):
        #att_units[embed_dim, att_vector]
        super(Attention_layer, self).__init__()
        
        self.att_w = nn.Linear(att_units[0], att_units[1])#W和b
        self.att_dense = nn.Linear(att_units[1], 1)#h
        
    def forward(self, bi_interaction):
        a = self.att_w(bi_interaction)
        a = F.relu(a)
        att_scores = self.att_dense(a)
        att_weight = F.softmax(att_scores, dim = 1)
        
        att_out = torch.sum(att_weight * bi_interaction ,dim = 1)
        return att_out
#AFM模型
class AFM(nn.Module):
    def __init__(self, feature_columns, att_vector = 8, dropout = 0.):
        super(AFM, self).__init__()

        self.dense_feature_cols, self.sparse_feature_cols = feature_columns

        #Embedding
        self.embed_layers = nn.ModuleDict({
            'embed_'+str(i):nn.Embedding(num_embeddings=feat['feat_num'],embedding_dim=feat['embed_dim'])
            for i, feat in enumerate(self.sparse_feature_cols)
        })

        self.attention = Attention_layer([self.sparse_feature_cols[0]['embed_dim'], att_vector])
        self.fea_num = att_vector
        self.linear = Linear(len(self.dense_feature_cols) , self.fea_num )
        self.final_linear = nn.Linear(self.fea_num, 1)

    def forward(self, x):
        dense_inputs, sparse_inputs = x[:,:len(self.dense_feature_cols)], x[:,len(self.dense_feature_cols):]
        sparse_inputs = sparse_inputs.long()
        sparse_embeds = [self.embed_layers['embed_'+str(i)](sparse_inputs[:,i]) for i in range(sparse_inputs.shape[1])]
        sparse_embeds = torch.stack(sparse_embeds)
        sparse_embeds = sparse_embeds.permute((1, 0, 2))

        first = []
        second = []
        for f,s in itertools.combinations(range(sparse_embeds.shape[1]), 2): 
            first.append(f)
            second.append(s)
        
        p = sparse_embeds[:,first,:]
        q = sparse_embeds[:,second,:]
        #两两交叉池化层
        bi_interaction = p * q
        #注意力机制池化层                 
        att_out = self.attention(bi_interaction)
        #线性层
        li_output = self.linear(dense_inputs)
        #注意力机制池化层+线性层送入最终线性层
        outputs = torch.add(att_out, li_output)
        
        outputs =  F.sigmoid(self.final_linear(outputs))
        
        return outputs

效果：batch_size= 256，加入了L2正则化（0.0001），dropout（0.1）
在这里插入图片描述

加了DNN的AFM

加了DNN的AFM跑的效果不好，个人感觉参数没调好。
在这里插入图片描述

#线性模型
class Linear(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(Linear,self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)
    def forward(self, x):
        output = self.linear(x)
        return output
#DNN模型
class Dnn(nn.Module):
    def __init__(self, hidden_units ,dropout=0.):
        super(Dnn, self).__init__()
        self.dnn_network = nn.ModuleList([nn.Linear(layer[0],layer[1]) for layer in list(zip(hidden_unit[:-1],hidden_units[1:]))])
        self.dropout = nn.Dropout(p = dropout)
        
    def forward(self , x):
        
        for layer in self.dnn_network:
            x = layer(x)
            x = F.relu(x)
        x = self.dropout(x)
        
        return x
#注意力模型
class Attention_layer(nn.Module):
    def __init__(self, att_units):
        
        #att_units:[embed_dim, att_vector]
        
        super(Attention_layer, self).__init__()
        
        self.att_w = nn.Linear(att_units[0], att_units[1])#W和b
        self.att_dense = nn.Linear(att_units[1], 1)#h
        
    def forward(self, x):
        a = self.att_w(x)
        a = F.relu(a)
        att_scores = self.att_dense(a)
        att_weight = F.softmax(att_scores, dim = 1)
        att_out = torch.sum(att_weight * x ,dim = 1)
        return att_out
#AFM模型
class AFM(nn.Module):
    def __init__(self, feature_columns, hidden_unit, att_vector = 8, dropout = 0.):
        super(AFM, self).__init__()

        self.dense_feature_cols, self.sparse_feature_cols = feature_columns

        #Embedding
        self.embed_layers = nn.ModuleDict({
            'embed_'+str(i):nn.Embedding(num_embeddings=feat['feat_num'],embedding_dim=feat['embed_dim'])
            for i, feat in enumerate(self.sparse_feature_cols)
        })

        self.attention = Attention_layer([self.sparse_feature_cols[0]['embed_dim'], att_vector])
        self.fea_num = self.sparse_feature_cols[0]['embed_dim']
        hidden_unit.insert(0, self.fea_num)
        self.dnn_network = Dnn(hidden_unit, dropout)
        self.Linear = Linear(len(self.dense_feature_cols), att_vector)
        self.final_linear = nn.Linear(hidden_unit[-1], 1)

    def forward(self, x):
        dense_inputs, sparse_inputs = x[:,:len(self.dense_feature_cols)], x[:,len(self.dense_feature_cols):]
        sparse_inputs = sparse_inputs.long()
        sparse_embeds = [self.embed_layers['embed_'+str(i)](sparse_inputs[:,i]) for i in range(sparse_inputs.shape[1])]
        sparse_embeds = torch.stack(sparse_embeds)
        sparse_embeds = sparse_embeds.permute((1, 0, 2))

        first = []
        second = []
        for f,s in itertools.combinations(range(sparse_embeds.shape[1]), 2): 
            first.append(f)
            second.append(s)
        
        p = sparse_embeds[:,first,:]
        q = sparse_embeds[:,second,:]
        
        #两两特征交叉层[256,325,8] 
        bi_interaction = p * q
        #注意力机制池化层 [256,8]                
        att_out = self.attention(bi_interaction)
        #DNN层
        dnn_output = self.dnn_network(att_out)
        #线性层
        li_output  = self.Linear(dense_inputs)
        #输出层
        outputs =  F.sigmoid(self.final_linear(torch.add(dnn_output,li_output)))
        
        return outputs

效果：batch_size= 256，加入了L2正则化（0.00013），dropout（0.23）

在这里插入图片描述

模型对比：

对比Wide&Deep、DeepFM、NFM、AFM、AFM_Deep

小的cretio数据集（选取了train 20000， test 5000）
在这里插入图片描述
较大的cretio数据集（选取了train 480000， test 120000）

同系列模型NFM与AFM模型，AFM确实比NFM效果要好。
对于AFM与AFM_Deep来说，带有深度神经网络的AFM（AFM_Deep）效果要比不带DNN的AFM效果要好，但是该实验中AFM_Deep跑的并不好，如果模型没问题的话，个人感受是参数调的不好。

标签：__,AFM,nn,self,Factorization,Machine,att,sparse
来源： https://blog.csdn.net/weixin_43667611/article/details/122238446

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

Attentional Factorization Machine（AFM）复现笔记

AFM公式：

设计模型

不带DNN的AFM

加了DNN的AFM

模型对比：