当前位置：移动技术网 > IT编程>开发语言>Java > TensorFlow 2.1.0 使用 TFRecord 转存与读取文本数据

TensorFlow 2.1.0 使用 TFRecord 转存与读取文本数据

2020年07月03日 | 移动技术网IT编程 | 我要评论

前言：

上次记录了一下如何使用 TFRecord 来转存图片与 label ，后续经手了一些 NLP 任务，尝试使用了 TF 2.1.0，所以记录一下如何使用 TFRecord 来保存和读取文本数据。

准备工作：

TFRecord 无法直接记录文本信息，所以需要首先对文本内容进行一些预处理的准备工作，分别是分词，去停用词，建立词典，以及将文本转化为词典 index。再将 index 值写入 TFRecord。

TFRecord

首先这里把训练集和验证集分构造为了两个 DataFrame ，然后一个 Text 文本对应两个 label。使用 Keras 中的 Tokenizer 进行词到字典的映射，同时把 label 转化为相应的 label index。

与保存图片不同的是，保存 Text index 时，需要用到 tf.train.FeatureLists()。转换 Text index 时，先将文本中的每一个 index 转换为一个 Int64List，再将整篇文章转换为一个 FeatureLists。

writer = tf.io.TFRecordWriter('./train_data_content_with_title')

for _, data in tqdm(train_pd.iterrows()):
    text = tokenizer.texts_to_sequences([data['content with title'].split(' ')])[0]
    text = list(map(lambda idx: tf.train.Feature(int64_list=tf.train.Int64List(value=[idx])), text))
    
    exam = tf.train.SequenceExample(
        context = tf.train.Features(
            feature = {
                'industry_label': tf.train.Feature(int64_list=tf.train.Int64List (value=[industry_dict[data["industry_label"]]])),
                'use_label': tf.train.Feature(int64_list=tf.train.Int64List (value=[use_dict[data["use_label"]]]))
                
            }
        ),
        feature_lists = tf.train.FeatureLists(
            feature_list={
                'text' : tf.train.FeatureList(feature=text)
            }
        )
    )
    writer.write (exam.SerializeToString())
writer.close()  

writer = tf.io.TFRecordWriter('./valid_data_content_with_title')
for _, data in tqdm(valid_pd.iterrows()):
    text = tokenizer.texts_to_sequences([data['content with title'].split(' ')])[0]
    text = list(map(lambda idx: tf.train.Feature(int64_list=tf.train.Int64List(value=[idx])), text))
    
    exam = tf.train.SequenceExample(
        context = tf.train.Features(
            feature = {
                'industry_label': tf.train.Feature(int64_list=tf.train.Int64List (value=[industry_dict[data["industry_label"]]])),
                'use_label': tf.train.Feature(int64_list=tf.train.Int64List (value=[use_dict[data["use_label"]]]))
                
            }
        ),
        feature_lists = tf.train.FeatureLists(
            feature_list={
                'text' : tf.train.FeatureList(feature=text)
            }
        )
    )
    writer.write (exam.SerializeToString())
writer.close()

读取：

读取时，同样也是将 Label 与 Text index 两部分分开解析。

train_reader = tf.data.TFRecordDataset('./train_data_content_with_title')
valid_reader = tf.data.TFRecordDataset('./valid_data_content_with_title')

context_features = {
    "industry_label": tf.io.FixedLenFeature([], dtype=tf.int64),
    "use_label": tf.io.FixedLenFeature([], dtype=tf.int64)
}
sequence_features = {
    "text": tf.io.FixedLenSequenceFeature([], dtype=tf.int64),
}

def parse_function(serialized_example):
    context_parsed, sequence_parsed = tf.io.parse_single_sequence_example(
        serialized=serialized_example,
        context_features=context_features,
        sequence_features=sequence_features
    )
    industry_label = context_parsed['industry_label']
    use_label = context_parsed['use_label']
    text = sequence_parsed['text']
    return text, industry_label, use_label


train_dataset = train_reader.repeat(1).shuffle(1280, reshuffle_each_iteration=True).map(parse_function).padded_batch(256, padded_shapes=([110], [], []))
valid_dataset = valid_reader.repeat(1).shuffle(1280, reshuffle_each_iteration=True).map(parse_function).padded_batch(256, padded_shapes=([110], [], []))

本文地址：https://blog.csdn.net/ZJRN1027/article/details/107079071

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

android sdk源码 andoid-21 下的TextUtils.java文本工具类源码赏析

下面这个是android sdk自带的文本工具，比如提供EditText对象的内容是否为空判断，截取字符串啊等等... [阅读全文]
荐计算机组成原理(哈工大)学习笔记

文章目录计算机组成原理一、计算机系统概论1.1计算机系统简介一、计算机的软硬件概念二、计算机系统的层次结构三、计... [阅读全文]
地理坐标（WGS84），投影坐标下（Mercator）切片系统的计算Java类

1、地理坐标下切片系统的计算地理坐标下切片系统的计算，主要适用于google地球中切片系统，以及目标底图参考系统... [阅读全文]
荐 JavaWeb~简单认识以太网、MAC地址、MTU机制、ARP协议、DNS协议

文章目录认识以太网简介以太网帧格式认识MAC地址对比理解IP地址与MAC地址认识MTUMTU对IP数据报进行分组... [阅读全文]
荐【技术流派】教你提高双目立体视觉系统的精度！

双目立体视觉系统，不谈精度几许，未免显得业余！ [阅读全文]
java基础知识整理大全 ------持续更新中

这里写自定义目录标题java的几大特性简单：java省去了对指针的操作，避免了使用指针时的指针异常情况。面向对象... [阅读全文]
HUAWEI MH5000-31 LGA Module Hardware Guide draft

1 IntroductionThis document describes the hardware appli... [阅读全文]
华为电视终于低下高傲的头颅，与小米电视比拼价格

华为旗下的荣耀品牌在昨天发布了一款55英寸智慧屏X1（实际就是电视产品），售价仅为1699元，将55英寸电视的价... [阅读全文]
微信X5浏览器video标签兼容性

实现需求：要求在页面内播放视频，不能弹窗播放，不全屏播放实现方式：<video playsinline="... [阅读全文]
LoRa已经成为了主流物联网络制式之一

近日，物联网产业又爆出猛料。LoRa 联盟（LoRa Alliance）和腾讯日前共同宣布，腾讯已在最高层面加入... [阅读全文]