综述述评

学术图表知识发现技术框架及研究进展

  • 丁培
展开
  • 深圳大学图书馆 深圳 518060
丁培,馆员,博士研究生,E-mail:peid@szu.edu.cn。

收稿日期: 2021-06-08

  修回日期: 2021-09-12

  网络出版日期: 2021-12-18

基金资助

本文系广东省哲学社会科学规划学科共建项目"支持深度知识发现的文内数据与文献关联研究"(项目编号:GD18XTS07)研究成果之一。

The Technical Framework and Research Progress of Knowledge Discovery in Academic Figures and Tables

  • Ding Pei
Expand
  • Shenzhen University Library, Shenzhen 518060

Received date: 2021-06-08

  Revised date: 2021-09-12

  Online published: 2021-12-18

摘要

[目的/意义]科技资源深度融合背景下,学术图表知识发现是提供除文本知识发现外新的知识发现方式,是完善文献知识发现的重要一环,可提升科研人员科学发现及知识创造效能,推动数字图书馆知识服务升级。[方法/过程]梳理学术图表知识发现的演进脉络,详细论证其技术框架内容,证明学术图表知识发现技术逐步成熟。结合学术图表知识发现应用服务,论证学术图表知识发现在科技创新多方面有广阔应用空间。[结果/结论]展望学术图表知识发现未来,我们需要:重视学术图表知识发现,将其融入文献知识发现体系内;完善学术图表语义知识组织体系,构建专门的学术图表语义知识库;开发新型学术图表知识发现应用。

本文引用格式

丁培 . 学术图表知识发现技术框架及研究进展[J]. 图书情报工作, 2021 , 65(23) : 136 -148 . DOI: 10.13266/j.issn.0252-3116.2021.23.015

Abstract

[Purpose/significance] Under the background of deep integration of scientific resources, knowledge discovery of academic figures and tables provides a new way of knowledge discovery besides text knowledge discovery. Knowledge discovery of academic figures and tables is an important segment in document knowledge discovery perfection, it improves the efficiency of scientific discovery and knowledge creation of researchers and promotes the upgrade of knowledge service of digital library.[Method/process] This paper sort out the evolution of knowledge discovery of academic figures and tables, demonstrated its technical framework in detail and proved that the knowledge discovery technology of academic figures and tables had been gradually mature. Combined with knowledge discovery application service with academic charts, this paper found that knowledge discovery of academic figures and tables could support scientific and technological innovation activities in many ways.[Result/conclusion] Looking into the future, we need to:attach importance to the knowledge discovery of academic figures and tables and integrate it into the literature knowledge discovery system; perfect the semantic knowledge organization system of academic figures and tables and build a special semantic knowledge base of academic figures and tables; develop new knowledge discovery applications for academic figures and tables.

参考文献

[1] SIEGEL N, LOURIE N, POWER R, et al. Extracting Scientific figures with distantly supervised neural networks[C]//Proceedings of the 18th ACM-IEEE on joint conference on digital libraries. Texas:ACM,2018:223-232.
[2] YU H, LEE M.Accessing bioscience images from abstract sentences[J].Bioinformatics, 2006, 22(14):547-556.
[3] STELMASZEWSKA H, BLANDFORD A. From physical to digital:a case study of computer scientists' behaviour in physical libraries[J]. International journal on digital libraries, 2004, 4(2):82-92.
[4] LEE P, WEST J D, HOWE B, et al. Viziometrics:analyzing visual information in the scientific literature[J].IEEE transactions on big data, 2018, 4(1):117-129.
[5] PYREDDY P, CROFT W B. TINTIN:A system for retrieval in text tables[C]//Proceedings of the second ACM international conference on digital libraries. Philadelphia:ACM,1997:193-200.
[6] LIU F, JENSSEN T, NYGAARD V, et al. FigSearch:a figure legend indexing and classification system.[J]. Bioinformatics, 2004, 20(16):2880-2882.
[7] TENOPIR C, SANDUSKY R, CASADO M. The value of CSA deep indexing for researchers (executive summary)[J]. School of information sciences publications and other works, 2006(1):1-4.
[8] LIU Y, BAI K, MITRA P, et al. TableSeer:automatic table metadata extraction and searching in digital libraries[C]//Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries. New York:ACM, 2007:91-100.
[9] XU S H, JAMES M C, MICHAEL K. Yale image finder (YIF)[J]. Bioinformatics, 2008,17(24):1968-1970.
[10] HONG Y, LIU F, RAMESH B P. Automatic figure ranking and user interfacing for intelligent figure search[J]. Plos one, 2010,5(10):e12983.
[11] NCBI.PMC[EB/OL].[2020-08-31].https://www.ncbi.nlm.nih.gov/pmc/.
[12] CNKI.CNKI图片检索[EB/OL].[2020-08-31].http://image.cnki.net/Default.aspx.
[13] SIEGEL N, HORVITZ Z, LEVIN R, et al. FigureSeer:parsing result-figures in research papers[C]//European conference on computer vision. Amsterdam:Springer International Publishing, 2016:664-680.
[14] National Library of Medicine.Open-i[EB/OL].[2020-08-31].https://openi.nlm.nih.gov/.
[15] FAYYAD U M, PIATETSKY-SHAPIRO G, SMYTH P. From data mining to knowledge discovery in databases[J]. Ai magazine, 1996,17(3):37-54.
[16] 唐皓瑾. 一种面向PDF文件的表格数据抽取方法的研究与实现[D]. 北京:北京邮电大学, 2015.
[17] 刘颖. 基于Web结构的表格信息抽取研究[D]. 合肥:合肥工业大学,2012.
[18] CHAO H, FAN J. Layout and content extraction for PDF documents[C]//Document analysis systems 2004. Florence:Springer, 2004:213-224.
[19] CHOUDHURY S R, GILES C L. An architecture for information extraction from figures in digital libraries[C]//International conference. international world wide web conferences steering committee. Florence:ACM,2015:667-672.
[20] CHHATKULI A, FONCUBIERTA-RODRÍGUEZ A, MARKONIS D, et al. Separating compound figures in journal articles to allow for subfigure classification[C]//Medical imaging 2013:Advanced pacs-based imaging informatics and therapeutic applications. Florida:SPIE Medical Imaging,2013:86740J.
[21] LI P, JIANG X, KAMBHAMETTU C, et al. Compound image segmentation of published biomedical figures[J]. Bioinformatics, 2018, 34(7):1192-1199.
[22] Apache Software Foundation.Apache PDFBox[EB/OL].[2021-05-02].https://pdfbox.apache.org.
[23] YUSUKE S.PDFMiner[EB/OL].[2021-05-02].https://github.com/euske/pdfminer.
[24] Glyph & Cog.Xpdf[EB/OL].[2021-05-02].http://www.xpdfreader.com.
[25] KristianHøgsberg.Poppler[EB/OL].[2021-05-02].http://poppler.freedesktop.org/.
[26] LUIS D L, JINGYI Y, CECILIA N, et al. An automatic system for extracting figures and captions in biomedical pdf documents[C]//2011 IEEE international conference on bioinformatics and biomedicine. Atlanta:IEEE, 2011:578-581.
[27] PRACZYK P A, NOGUERAS-ISO J, MELE S. Automatic extraction of figures from scientific publications in high-energy physics[J]. Information technology and libraries, 2013, 32(4):25-52.
[28] CLARK C, DIVVALA S. PDFFigures 2.0:mining figures from research papers[C]//Proceedings of the 16th ACM/IEEE-CS on joint conference on digital libraries. Newark:ACM, 2016:143-152.
[29] LI P Y, JIANG X Y, SHATKAY H,et,al. Figure and caption extraction from biomedical documents.[J]. Bioinformatics, 2019,35(21):4381-4388.
[30] YILDIZ B, KAISER K, MIKSCH S. Pdf2table:a method to extract table information from pdf files[C]//Proceedings of the 2nd Indian international conference on artificial intelligence. Pune:DBLP, 2008:1-13.
[31] 李海涛, 柳健, 明德烈,等. 一种统计特征点网格分布的表格图像识别方法[J]. 华中科技大学学报(自然科学版), 2002, 30(9):60-63.
[32] 张伯. 基于PDF文字流的表格识别技术的研究[D]. 北京:北京工业大学, 2010.
[33] MANUEL A, MIKE T, JEREMY B M.Tabula[EB/OL].[2021-08-31]. https://tabula.technology/.
[34] RASTAN R, PAIK H Y, SHEPHERD J. TEXUS:A unified framework for extracting and understanding tables in PDF documents[J].Information processing & management, 2019, 55(3):895-918.
[35] PEREZARRIAGA M O, ESTRADA T, ABADMOTA S. TAO:system for table detection and extraction from pdf documents[C]//Proceedings of the 29th international Florida artificial intelligence research society conference. Florida:AAAI, 2016:591-596.
[36] SAS J, ZOLNIEREK A. Three-stage method of text region extraction from diagram raster images[J]. Advances in intelligent systems and computing, 2013, 226:527-538.
[37] FALK BÖSCHEN, ANSGAR SCHERP. A Comparison of approaches for automated text extraction from scholarly figures[C]//International conference on multimedia modeling. Reykjavik:Springer, 2017:15-27.
[38] CHIANG Y Y, KNOBLOCK C.A. Recognizing text in raster maps[J]. Geoinformatica, 2015(19):1-27.
[39] XU, S H, MICHAEL K. A new pivoting and iterative text detection algorithm for biomedical images[M]. Elsevier Science, 2010.
[40] DE S, STANLEY R J, CHENG B, et al. Automated text detection and recognition in annotated biomedical publication images[J]. International journal of healthcare information systems and informatics, 2014, 9(2):34-63.
[41] HE F, WANG D, INNOKENTEVA Y, et al. Extracting molecular entities and their interactions from pathway figures based on deep learning[C]//2019 IEEE international conference on bioinformatics and biomedicine (bibm). San Diego:IEEE, 2020:1191-1193.
[42] NAGY G. Learning the characteristics of critical cells from web tables[C]//International conference on pattern recognition. Tsukuba:IEEE, 2012:1554-1557.
[43] SETH S C, NAGY G. Segmenting tables via indexing of value cells by table headers[C]//International conference on document analysis and recognition. Washington, DC:IEEE, 2013:887-891.
[44] HONG Y, AGARWAL S, JOHNSTON M.Are figure legends sufficient? Evaluating the contribution of associated text to biomedical figure comprehension[J]. Journal of biomedical discovery & collaboration, 2009, 4(1):1-10.
[45] CHOUDHURY S R, MITRA P, KIRK A,et,al. Figure metadata extraction from digital documents[C]//International conference on document analysis & recognition. ieee computer society. Washington, DC:IEEE, 2013:135-139.
[46] LOPEZ L D, YU J, ARIGHI C N, et al. An automatic system for extracting figures and captions in biomedical pdf documents[C]//IEEE international conference on bioinformatics & biomedicine. Atlanta:IEEE, 2012:578-581.
[47] BALAJI P R, SETHI R J, HONG Y, et al. Figure-associated text summarization and evaluation[J]. Plos One, 2015, 10(2):e0115671.
[48] YU H. Towards answering biological questions with experimental evidence:automatically identifying text that summarize image content in full-text articles[C]//Annual symposium proceedings/amia symposium. amia symposium. Washington, DC:AMia, 2006:834-838.
[49] BHATIA S, MITRA P. Summarizing figures, tables and algorithms in scientific publications to augment search results[J]. ACM transactions on information systems, 2010, 30(1):1-24.
[50] MANNING C D, RAGHAVAN P, H SCHVTZE. Introduction to information retrieval[M]. 北京:人民邮电出版社, 2010.
[51] TURTLE H R, CROFT W B. Inference networks for document retrieval[C]//13th international conference on research and development in information retrieval. Brussels:ACM,1990:1-24.
[52] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. Computer science, 2013, arXiv:1301.3781.
[53] SHUAI Z, CHENG M M, WARRELL J, et al. Dense semantic image segmentation with objects and attributes[C]//2014 IEEE conference on computer vision and pattern recognition (CVPR). Columbus:IEEE, 2014, 3214-3221.
[54] VEZHNEVETS A, FERRARI V, BUHMANN J.M.Weakly supervised structured output learning for semantic segmentation[C]//2012 IEEE conference on computer vision and pattern recognition. Providence:IEEE, 2012:845-852.
[55] HUI Z, FRITTS J E, GOLDMAN S A. Image segmentation evaluation:a survey of unsupervised methods[J]. Computer vision & image understanding, 2008, 110(2):260-280.
[56] PEDERSEN K S, LOOG M, DORST P.Salient point and scale detection by minimum likelihood[C]//Proceedings of machine learning research. Bletchley Park:PMLR, 2007:59-72.
[57] LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International journal of computer vision, 2004, 60(2):91-110.
[58] DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//IEEE computer society conference on computer vision & pattern recognition. San Diego:IEEE, 2005,886-893.
[59] NG R T, SEDIGHIAN A.Evaluating multidimensional indexing structures for images transformed by principal component analysis[C]//Proceedings volume 2670, storage and retrieval for still image and video databases iv. San Jose:SPIE, 1996:50-61.
[60] PHAM, T T, MAILLOT N E, LIM J H, et al. Latent semantic fusion model for image retrieval and annotation[C]//Proceedings of the sixteenth ACM conference on information and knowledge management. Lisbon:ACM, 2007:439-444.
[61] INDYK P. Approximate nearest neighbors:towards removing the curse of dimensionality[C]//Proceedings of the 30th acm symposium on theory of computing (stoc'98). Dallas Texas:ACM, 1998:604-613.
[62] 杨战波.基于深度学习和词嵌入的视觉语义嵌入研究[D] 重庆:西南大学,2019.
[63] WANG H, ZHANG Y, JI Z, et al. Consensus-aware visual-semantic embedding for image-text matching[C]//2020 european conference on computer vision. Glasgow:Qrxiv, 2020:18-34.
[64] WEN K, GU X, CHENG Q. Learning dual semantic relations with graph attention for image-text matching[J]. IEEE transactions on circuits and systems for video technology, 2020(99):1-1.
[65] 陈涛,单蓉蓉,李惠.数字人文中图像资源的语义化标注研究[J].农业图书情报学报,2020,32(9):6-14.
[66] BHAGAT P K, CHOUDHARY P. Image annotation:then and now[J].Image and vision computing, 2018(80):1-23.
[67] ADNAN M M, RAHIM M, REHMAN A, et al. Automatic image annotation based on deep learning models:a systematic review and future challenges[J]. IEEE access, 2021(9):50253-50264.
[68] MIAO R, TOTH R, ZHOU Y, et al. Quick annotator:an open-source digital pathology based rapid image annotation tool[J] The journal of pathology,20217(6):542-547.
[69] DONG Q, LUO G, HAYNOR D, et al. DicomAnnotator:a configurable open-source software program for efficient dicom image annotation[J]. Journal of digital imaging,2020,33(6):1514-1526.
[70] 孙坦, 丁培, 黄永文, 等. 文本挖掘技术在农业知识服务中的应用述评[J]. 农业图书情报学报, 2021, 33(1):4-16.
[71] POCO J, HEER J. Reverse-engineering visualizations:recovering visual encodings from chart images[J]. Computer graphics forum, 2017, 36(3):353-363.
[72] KIM S, LIU Y. Functional-based table category identification in digital library[C]//2011 international conference on document analysis and recognition, ieee,2011:1364-1368.
[73] SAVVA M, KONG N, CHHAJTA A, et al. ReVision:automated classification, analysis and redesign of chart images[C]//User interface software and technology. New York:ACM, 2011:393-402.
[74] NKWENTSHA X, HOUNKANRIN A, NICOLLS F. Automatic classification of medical X-ray images with convolutional neural networks[C]//2020 international saupec/robmech/prasa conference. Cape Town:Springer, 2020:1-4.
[75] HUANG W, ZONG S, TAN C L, et al. Chart image classification using multiple-instance learning[C]//Workshop on applications of computer vision. Texas:ACM, 2007:27-27.
[76] PELKA O, FRIEDRICH C M. FHDO biomedical computer science group at medical classification task of ImageCLEF 2015[C]//Working notes of CLEF 2015 conference. Toulouse:CEUR-WS, 2015.
[77] LI P,SORENSEN S,KOLAGUNDA A, et al. UDEL CIS working notes in ImageCLEF 2016[C]//Working notes of CLEF 2016 conference. Portugal:CEUR-WS, 2016:334-346.
[78] CHHATKULI A, FONCUBIERTA-RODRIGUEZ A, MARKONIS D, et al. Separating compound figures in journal articles to allow for subfigure classification[C]//Proceedings of spie medical imaging, advanced pacs-based imaging informatics and therapeutic applications. Orlando:SPIE, 2013:86740.
[79] YUAN X, ANG D. A novel figure panel classification and extraction method for document image understanding[J]. International journal of data mining and bioinformatics, 2014, 9(l):22-36.
[80] Li P, Jiang X, Kambhamettu C, et al. Segmenting compound biomedical figures into their constituent panels[C]//International conference of the cross-language evaluation forum for europeanm languages. Dublin:Springer, 2017:199-210.
[81] TASCHWER M, MARQUES O. Compound figure separation combining edge and band separator detection[C]//International conference on multimedia modeling. Miami:Springer, 2016:162-173.
[82] SANTOSH K C, AAFAQUE A, ANTANI S, et al. Line segment-based stitched multipanel figure separation for effective biomedical CBIR[J]. International journal of pattern recognition and artificial intelligence, 2017, 31(6):1757003.
[83] 于玉海.面向医学文献的图像模式识别关键技术研究[D]. 大连:大连理工大学.2018.
[84] CRESTAN E, PANTEL P. Web-scale table census and classification[C]//Proceedings of the fourth acm international conference on web search and data mining. Hong Kong:ACM,2011:545-554.
[85] MURPHY R F, VELLISTE M, YAO J, et al. Searching online journals for fluorescence microscope images depicting protein subcellular location patterns[C]//IEEE international symposium on bioinformatics & bioengineering. Bethesda:IEEE, 2001:119-128.
[86] GERTZ M, SATTLER K U, GORIN F, et al. Annotating scientific images:a concept-based approach[C]//Proceedings 14th international conference on scientific and statistical database management. Los Alamitos:IEEE, 2002:59-68.
[87] EMAGE.Data Annotation Methods[EB/OL].[2020-11-02].http://www.emouseatlas.org/emage/about/data_annotation_methods.html#auto_eurexpress.
[88] TOO E C, YUJIAN L, NJUKI S, et al. A comparative study of fine-tuning deep learning models for plant disease identification[J]. Computers and electronics in agriculture, 2018,161(1):272-279.
[89] BARBEDO J A. Plant disease identification from individual lesions and spots using deep learning[J]. Biosystems engineering, 2019, 180(1):96-107.
[90] KUHN T, NAGY M, LUONG T B, et al. Mining images in biomedical publications:Detection and analysis of gel diagrams[J]. J biomed semantics, 2014, 5(1):1-9.
[91] ZHANG Z.Towards efficient and effective semantic table interpretation[C]//International semantic Web conference. New York:Springer-verlag, 2014:487-502.
[92] CAO H, BOWERS S, SCHILDHAUER M P. Approaches for semantically annotating and discovering scientific observational data[C]//Database and expert systems applications. Berlin:Springer, 2011:526-541.
[93] MARTIN M, NUFFELEN B, ABRUZZINI S,et al.The digital agenda scoreboard:a statistical anatomy of Europe's way into the information age[EB/OL].[2021-05-02].http://www.semantic-web-journal.net/sites/default/files/swj283.pdf.
[94] KEMBHAVI A, SALVATO M, KOLVE E, et al. A diagram is worth a dozen images[C]//Computer vision-eccv 2016. Amsterdam:Springer, 2016:235-251.
[95] LEE P, YANG T. S, WEST J, et al. Phyloparser:a hybrid algorithm for extractingphylogenies from dendrograms[C]//14th iapr international conference on document analysis and recognition (icdar). Kyoto:IEEE, 2017:1087-1094.
[96] 何英. PubMed Central文献中的柱形图信息抽取研究与应用[D]. 武汉:武汉理工大学,2018.
[97] AGARWAL S, YU H. FigSum:automatically generating structured text summaries for figures in biomedical literature.[C]//American medical informatics association annual symposium. San Francisco:PMC, 2009:6-10.
[98] SAINI N, SAHA S, POTNURUV, et al. Figure summarization:a multiobjective optimization-based approach[J]. Intelligent systems, 2019,34(6):43-52.
[99] SAINI N, SAHA S, BHATTACHARYYA P, et al. Textual entailment——based figure summarization for biomedical articles[J].ACM transactions on multimedia computing communications and applications, 2020, 16(1s):1-24.
[100] CHEN J, ZHUGE H. Extractive summarization of documents with images based on multi-modal RNN[J]. Future generation computer systems, 2019,99(1):186-196.
[101] 吴晨飞.基于关系建模的视觉问答研究[D]. 北京:北京邮电大学, 2020.
[102] KAFLE K, PRICE B, COHEN S, et al. DVQA:understanding data visualizations via question answering[C]//2018 IEEE/cvf conference on computer vision and pattern recognition. Salt Lake City:IEEE, 2018:5648-5656.
[103] KAHOU S E, MICHALSKI V, ATKINSON A, et al. FigureQA:an annotated figure dataset for visual reasoning[J]. Computer science, 2018, arXiv:1710.07300.
[104] CHAUDHRY R, SHEKHAR S, GUPTA U, et al. LEAF-QA:locate, encode & attend for figure question answering[C]//2020 IEEE winter conference on applications of computer vision (wacv). Snowmass Village:IEEE, 2020:3512-3521.
文章导航

/