Research on Structure Function Recognition of Academic Text Based on Multi-level Fusion

  • Wang Jiamin ,
  • Lu Wei ,
  • Liu Jiawei ,
  • Cheng Qikai
Expand
  • 1. School of Information Management, Wuhan University, Wuhan 430072;
    2. Information Retrieval and Knowledge Mining Laboratory, Wuhan University, Wuhan 430072

Received date: 2018-09-13

  Revised date: 2019-01-25

  Online published: 2019-07-05

Abstract

[Purpose/significance] The structure function of the academic text refers to the summarization of academic text structure and section function. While few of existed studies pay attention to the fusion of multi-level structure of academic text, and the traditional methods usually rely on artificial experience to build rules or features. After the analysis of the multi-level structure of academic text, we construct a structure function recognition model based on multi-level fusion.[Method/process] We use the academic text dataset from ScienceDirect for experiment. First, we apply deep learning algorithms to identify the structure function of academic text at different level. Then we employ the voting method to fuse the results from different levels and models.[Result/conclusion] The results show that the performance improved to varying degrees after fusion. The precision, recall and F1 value of the combined results reached 86%, 84% and 84%, respectively. Compared with the traditional machine learning algorithm SVM, the deep learning algorithm has better performance in the task of academic text classification. Finally, we analyze the misclassification of the structure function of academic text and point out the potential application fields and future research directions.

Cite this article

Wang Jiamin , Lu Wei , Liu Jiawei , Cheng Qikai . Research on Structure Function Recognition of Academic Text Based on Multi-level Fusion[J]. Library and Information Service, 2019 , 63(13) : 95 -104 . DOI: 10.13266/j.issn.0252-3116.2019.13.010

References

[1] XIA F, WANG W, BEKELE T M, et al. Big scholarly data:a survey[J]. IEEE transactions on Big Data, 2017, 3(1):18-35.
[2] HUG S E, BRANDLE M P. The coverage of microsoft academic:analyzing the publication output of a university[J]. Scientometrics, 2017, 113(3):1551-1571.
[3] RIBAUPIERRE H D, FALQUET G. Extracting discourse elements and annotating scientific documents using the sciannotdoc model:a use case in gender documents[J]. International journal on digital libraries, 2017, 18(3):1-16.
[4] RAHMAN M M, FININ T. Deep understanding of a document's structure[C]//Proceedings of the 3rd IEEE/ACM international conference on Big Data computing, applications and technologies. Shanghai:IEEE/ACM, 2017:63-73.
[5] ALZAHRANI S, PALADE V, SALIM N, et al. Using structural information and citation evidence to detect significant plagiarism cases in scientific publications[J]. Journal of the American Society for Information Science and Technology, 2012, 63(2):286-312.
[6] KHAN S, LIU X F, SHAKIL K A, et al. A survey on scholarly data:from big data perspective[J]. Information processing and management, 2017, 53(4):923-944.
[7] 陆伟, 黄永, 程齐凯. 学术文本的结构功能识别——功能框架及基于章节标题的识别[J]. 情报学报, 2014, 33(9):979-985.
[8] LUONG M T, NGUYEN T D, KAN M Y. Logical structure recovery in scholarly articles with rich document features[J]. International journal of digital library systems, 2010, 1(4):1-23.
[9] SOLLACI L B, PEREIRA M G. The introduction, methods, results, and discussion (IMRAD) structure:a fifty-year survey[J]. Journal of the medical library association, 2014, 92(3):364-367.
[10] 方龙, 李信, 黄永, 等. 学术文本的结构功能识别——在关键词自动抽取中的应用[J]. 情报学报, 2017, 36(6):599-605.
[11] HU Z G, CHEN C M, LIU Z Y. Where are citations located in the body of scientific articles? a study of the distributions of citation locations[J]. Journal of informetrics, 2013, 7(4):887-896.
[12] DING Y, LIU X Z, GUO C, et al. The distribution of references across texts:some implications for citation analysis[J]. Journal of informetrics, 2013, 7(3):583-592.
[13] TUAROB S, MITRA P, GILES C L. A hybrid approach to discover semantic hierarchical sections in scholarly documents[C]//Proceedings of the 13th international conference on document analysis and recognition. Mancy:IAPR, 2015:1081-1085.
[14] 黄永, 陆伟, 程齐凯. 学术文本的结构功能识别——基于章节内容的识别[J]. 情报学报, 2016, 35(3):293-300.
[15] 黄永, 陆伟,程齐凯,等. 学术文本的结构功能识别——基于段落的识别[J]. 情报学报, 2016, 35(5):530-538.
[16] 奚雪峰, 周国栋. 面向自然语言处理的深度学习研究[J]. 自动化学报, 2016, 42(10):1445-1465.
[17] MAO S, ROSENFELD A, KANUNGO T. Document structure analysis algorithms:a literature survey[J]. Proc spie electronic imaging, 2003(5010):197-207.
[18] KIM J, LE D X, THOMA G R. Automated labeling in document images[C]//Proceedings of the SPIE conference on document recognition and retrieval VⅢ. San Jose:SPIE, 2000:111-122.
[19] CONSTANTIN A, PETTIFER S, VORONKOV A. PDFX:fully-automated PDF-to-XML conversion of scientific literature[C]//Proceedings of the ACM symposium on document engineering. Florence:ACM, 2013:177-180.
[20] HINTON G E, SALAKHUTDINOY R R. Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313(5786):504-507.
[21] SALAKUTDINOV R, HINTON G E. Semantic hashing[J]. International journal of approximate reasoning, 2009, 50(7):969-978.
[22] GLOROT X, BORDES A, BENGIO Y. Domain adaptation for large-scale sentiment classification:a deep learning approach[C]//Proceedings of the 28th international conference on machine learning. Washington:Ominpress, 2011:513-520.
[23] ZHANG L. Grasping the structure of journal articles:utilizing the functions of information units[J]. Journal of the Association for Information Science and Technology, 2012, 63(3):469-480.
[24] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[25] GRAVES A. Supervised sequence labelling with recurrent neural networks[D]. München:Technische Universität München, 2008.
[26] 章成志. 基于集成学习的自动标引方法研究[J]. 情报学报, 2010, 29(1):3-8.
Outlines

/