Approaches to Text Mining that Preserve Semantic Content
Yasmin H. Said
George Mason University, Fairfax, United States

Text mining can be thought of as a synthesis of information retrieval, natural language processing, and statistical data mining. The set of documents being considered can scale to hundreds of thousands and the associated lexicon can be a million or more words. Information retrieval often focuses on the so-called vector space model. Clearly, the vector space model can involve very high-dimensional vector spaces. Analysis using the vector space model is done by consideration of a term-document matrix. However, the term-document matrix basically is a bag-of-words approach capturing little semantic content. We have been exploring bigrams and trigrams as a method for capturing some semantic content and generalizing the term-document matrix to a bigram-document matrix. The cardinality of the set of bigrams is in general not as big as (n2); it is nonetheless usually considerably larger than n, where n is the number of words in the lexicon. I describe in this talk some of our recent work in this area.

Biography: Dr. Yasmin H. Said was a Ruth L. Kirschstein Research Professor from the National Institutes of Health at George Mason University and was a Visiting Fellow at the Isaac Newton Institute for Mathematical Sciences at the University of Cambridge in England. She earned her A.B. in pure mathematics, her M.S. in computer science and information systems, and Ph.D. in computational statistics. She does alcohol modeling, agent-based simulation modeling, social network analysis, text, image, and data mining, and major public policy work trying to minimize negative acute outcomes, including HIV/AIDS, related to alcohol consumption. Dr. Said is also the Statistical Methodology Director of the Innovative Medical Institute, LLC, and Co-Director of the Center for Computational Data Sciences in the College of Science at George Mason University. She is the editor-in-chief of Wiley Interdisciplinary Reviews: Computational Statistics, editor of Computing Science and Statistics, was an associate editor of the journal, Computational Statistics and Data Analysis, serves as the President of the Interface Foundation of North America, serves on the executive board of the U.S. Army Conference on Applied Statistics, serves on the executive board of the Washington Statistical Society, serves on the executive board of directors of the Interface Foundation of North America, serves on the American Statistical Association Presidential Task Force on Science Policy, serves on the American Statistical Association Presidential Advisory Committee on Scientific and Public Affairs, serves on the Executive Committee of the American Statistical Association Section on Defense and National Security, and serves on the Scientific Steering Committee for the U.S. Army Conference on Applied Statistics.

She has published a book, Intervention to Prevention: A Policy Tool for Alcohol Studies. She is currently one of the editors in chief of the multi volume Wiley Encyclopedia of Computational. Dr. Said has a patent on Policy Analysis and Action Decision Tool, two patents pending, Automated Generation of Metadata, and A Multimodal System Tool for Aiding Autonomous Discovery. Dr. Said is an elected member of the International Statistical Institute, an elected member of the Research Society on Alcoholism, an elected member of Sigma Xi, the Scientific Research Society, an elected member of the Washington Academy of Science.