On 18th December, 2019, the Institute for Global Public Policy (IGPP) held the first Workshop on Global Public Policy. Mr. Seokkyun Woo, a PhD student of School of Public Policy at Georgia Institute of Technology, delivered a lecture entitled Using Machine Learning and Natural Language Processing for Science Policy Studies. This lecture was moderated by Dr. Yin Li of School of International Relations & Public Affairs.
In the beginning, Mr. Seokkyun Woo explained the reasons why machine learning (ML) and natural language processing (NLP) should be used in social sciences research. In his view, human interaction, communication, and culture expressed through digital texts have increased significantly so that digital texts can be an important source of information or data for performing research. For example, we can use texts of product reviews to study consumer behavior, use textual information to predict macroeconomic trends, or investigate knowledge production based on textual information based on a large number of academic publications. However, different from our common intuitively displayed numerical variables, the text itself has high-dimensional characteristics. If we want to use texts to carry out the above research, we need to use appropriate methods to handle text data. Fortunately, NLP can help us measure or classify text data at a quantitative level, thereby providing new data sources for subsequent causal inference analyses.
Subsequently, Mr. Seokkyun Woo introduced ML and NLP. He presented a recent study entitled, On the Shoulders of Fallen Giants: Understanding Scientists’ Behaviors Through Post-retraction Citations. This study was on the basis of the fact that even after scientific papers are retracted, hundreds of studies cite them as evidence. Based on the existing literature, Mr. Seokkyun Woo provided possible reasons for this fact, and provided a literature review on how to understand citation behavior from the perspectives of scientific normative structure, social construction, and perfunctory citations. He proposed two research questions: are post-retraction citations generated by the domains that exhibit a longer cognitive distance from the domain the retracted papers belong to; and if the cognitive distance is important, can this sloppy citation behavior be explained by scientists’ ignorance.
Based on the above two research questions, Mr.Seokkyun Woo proposed the hypotheses. The novelty of this research is the use of knowledge distance to measure the relevance between the cited reference and the citing paper. The knowledge distance is measured by NLP, specifically the word embedding methods. As a key concept in NLP, word embedding can be used to transform a word into a vector for mathematical processing, and generate a unique vector for a text file. The Cosine similarity is used to measure the cognitive distance between the cited text and the citing text. The size of the domains that is calculated using MeSH methods is included as a control. The results suggest that the cognitive distance has a significantly positive effect on post-retraction citations, while this effect is significant for publications that were published in journals with high journal impact factors. This finding indicates that the longer the cognitive distance between the cited reference and the citing paper, the higher the probability of the retracted paper getting cited after retraction.
At the end of the lecture, Mr. Seokkyun Woo concluded the study and discussed various topics with the audience.