维基百科随机采样句子数据集-jeffheaton
数据来源:互联网公开数据
标签:维基百科,句子采样,句子对比,文本分析,自然语言处理,教育,研究
数据概述:
本数据集包含从维基百科随机采样的句子,旨在用于句子对比分析。数据集中的句子来自于不同的文章,部分句子来自同一篇文章,而另一些则来自不同的文章。数据集通过对比句子间的主题相关性,帮助用户识别句子是否来自同一篇文章。
数据用途概述:
该数据集适用于自然语言处理研究、文本分析、教育训练等场景。研究人员可以利用此数据集进行句子相似性分析,训练文本分类和对比模型;教育工作者可将其用于语言学教学,帮助学生理解句子结构和主题相关性。此外,该数据集也可用于开发和评估文本处理算法,如句子匹配、主题识别等。
举例1:
这些两个句子来自同一文章:
There were 2,788 housing units at an average density of 4 per squaremile (2/km).
It is also home to the Oklahoma State Reformatory, located in Granite.
这些两个句子也是来自同一文章:
Monument of the Judiciary Citadel of Salerno, near the Colle Bellara.
The La Carnale Castle got his name from a medieval battle against the Arabs and is part of a sport complex (with pool, tennis courts and hockey).
这些两个句子同样来自同一文章:
The idea of Haar measure is to take a sort of limit of as becomes smaller to make it additive on all pairs of disjoint compact sets, though it first has to be normalized so that the limit is not just infinity.
When left and right Haar measures differ, the right measure is usually preferred as a prior distribution.
举例2:
这些两个句子来自不同文章:
US Open womens doubles champion France Ranked world No.
The average household size was 2.72 and the average family size was 3.19.
这些两个句子也是来自不同文章:
The initial goal of the WordNet project was to build a lexical database that would be consistent with theories of human semantic memory developed in the late 1960s.
Males had a median income of $25,625 versus $20,515 for females.
这些两个句子同样来自不同文章:
Meanwhile, Western foods which are rich in fat, salt, sugar, and refined starches are also imported into countries.
According to the United States Census Bureau, the CDP has a total area of , of which is land and (3.61%) is water.
免责声明:
请注意,我们尝试移除包含争议性词汇或仇恨言论的数据。然而,由于维基百科中存在此类内容,因此在本数据集中可能仍存在一些争议性材料。由于数据集规模庞大,无法对每个句子进行人工审核。