ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

LSTM 文本预测

2022-04-29 15:32:38  阅读:187  来源: 互联网

标签:预测 int char predict train print LSTM 文本 data


LSTM

RNN对于前面的信息会有意外,LSTM可以基础相应的信息

code

#加载数据
data = open("LSTM_text.txt").read()
#移除换行
data = data.replace("\n","").replace("\r","")
print(data)

#分出字符
letters = list(set(data))
print(letters)
num_letters = len(letters)
print(num_letters)


#建立字典
int_to_char = {a:b for a,b in enumerate(letters)}
print(int_to_char)
char_to_int = {b:a for a,b in enumerate(letters)}
print(char_to_int)

#设置步长
time_step  = 20


#批量字符数据预处理
import numpy as np
from tensorflow.keras.utils import to_categorical
#滑动窗口提取数据
def extract_data(data,slide):
    x = []
    y = []
    for i in range(len(data) - slide):
        x.append([a for a in data[i:i+slide]])
        y.append(data[i+slide])
    return x,y
#字符到数字的批量转换
def char_to_int_Data(x,y,char_to_int):
    x_to_int = []  
    y_to_int = []
    for i in range(len(x)):
        x_to_int.append([char_to_int[char] for char in x[i]])
        y_to_int.append([char_to_int[char] for char in y[i]])
    return x_to_int,y_to_int

#实现输入字符文章的批量处理,输入整个字符,滑动窗口大小,转化字典
def data_preprocessing(data,slide,num_letters,char_to_int):
    char_data = extract_data(data,slide)
    int_data = char_to_int_Data(char_data[0],char_data[1],char_to_int)
    Input = int_data[0]
    Output = list(np.array(int_data[1]).flatten()  )
    Input_RESHAPED = np.array(Input).reshape(len(Input ),slide)
    new = np.random.randint(0,10,size=[Input_RESHAPED.shape[0],Input_RESHAPED.shape[1],num_letters])
    for i in range(Input_RESHAPED.shape[0]):
        for j in range(Input_RESHAPED.shape[1]):
            new[i,j,:] = to_categorical(Input_RESHAPED[i,j],num_classes=num_letters)
    return new,Output

# 提取X y
X,y = data_preprocessing(data,time_step,num_letters,char_to_int)


print(X)


print(X.shape)
print(len(y))


from sklearn.model_selection import  train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state=10)
print(X_train.shape,X_test.shape,X.shape)


y_train_category = to_categorical(y_train,num_letters)
print(y_train_category)
print(y)


# set up the model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,LSTM

model = Sequential()
# input_shape 看样本的
model.add(LSTM(units=20,input_shape=(X_train.shape[1],X_train.shape[2]),activation="relu"))

#输出层 看样本有多少页
model.add(Dense(units=num_letters ,activation="softmax"))
model.compile(optimizer="adam",loss="categorical_crossentropy",metrics=["accuracy"])
model.summary()

#训练模型
model.fit(X_train,y_train_category,batch_size=1000,epochs=50)


#预测
y_train_predict = model.predict_classes(X_train)
#转换成文本
y_train_predict_char = [int_to_char[i] for i in y_train_predict ]
print(y_train_predict_char)


# 训练准确度
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_train,y_train_predict)
print(accuracy)


# 测试集准确率
y_test_predict = model.predict_classes(X_test)
accuracy_test = accuracy_score(y_test,y_test_predict)
print(accuracy_test)
y_test_predict_char = [int_to_char[i] for i in y_test_predict ]


new_letters = 'The United States continues to lead the world with more than '
X_new,y_new = data_preprocessing(new_letters,time_step,num_letters,char_to_int)
y_new_predict = model.predict_classes(X_new)
print(y_new_predict)


y_new_predict_char = [int_to_char[i] for i in y_new_predict ]
print(y_new_predict_char)


for i in range(0,X_new.shape[0]-20):
    print(new_letters[i:i+20],'--predict next letter is --',y_new_predict_char[i])

参考链接

https://gitee.com/nickdlk/python_machine_learning

标签:预测,int,char,predict,train,print,LSTM,文本,data
来源: https://www.cnblogs.com/eat-too-much/p/16206827.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有