Analytical Text Programs

Genre	Program Name	Developer	Citation	Description from website	Languages	URL
Semantic, Syntactic	Coh-Metrix	Graesser et al.	McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge, M.A.: Cambridge University Press.	Coh-Metrix is a system for computing computational cohesion and coherence metrics for written and spoken texts. Coh-Metrix allows readers, writers, educators, and researchers to instantly gauge the difficulty of written text for the target audience.	English, Chinese	http://cohmetrix.com/
Sentiment	LIWC	Pennebaker et al.	Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71(2001), 2001.	LIWC2015 is the gold standard in computerized text analysis. Learn how the words we use in everyday language reveal our thoughts, feelings, personality, and motivations. Based on years of scientific research, LIWC2015 is more accurate, easier to use, and provides a broader range of social and psychological insights compared to earlier LIWC versions. Check it out.	English, Chinese, Arabic, Spanish, Dutch, French, German, Italian,Russian and Turkish	https://liwc.wpengine.com/
Personality	IBM Watson	IBM	High, R. (2012). The era of cognitive systems: An inside look at IBM Watson and how it works. IBM Corporation, Redbooks.	Go beyond artificial intelligence with Watson. With Watson APIs and solutions, businesses are already achieving outcomes – from improving customer engagement, to scaling expertise, to driving innovation and growth.	Arabic, Catalan, Chinese, Danish, Dutch, English, Farsi, Finnish, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish	https://www.ibm.com/watson/developercloud/personality-insights.html
Visual	Gephi	Bastian et al.	Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: an open source software for exploring and manipulating networks. ICWSM, 8, 361-362.	Gephi is the leading visualization and exploration software for all kinds of graphs and networks. Gephi is open-source and free. Runs on Windows, Mac OS X and Linux.	N/A	https://gephi.org/
Semantic, Syntactic	Semilar	Vasile Rus	Rus, V., Lintean, M., Banjade, R., Niraula, N., and Stefanescu, D. (2013). SEMILAR: The Semantic Similarity Toolkit. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, August 4-9, 2013, Sofia, Bulgaria.	The SEMILAR software environment offers users, researchers, and developers, easy access to fully-implemented semantic similarity methods in one place through both a GUI-based interface and a library.	English	http://www.semanticsimilarity.org/
Sentiment	SentiStrength	Thelwall et al.	Thelwall, M., Buckley, K., Paltoglou, G. Cai, D., & Kappas, A. (2010). Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12), 2544–2558.	SentiStrength estimates the strength of positive and negative sentiment in short texts, even for informal language. It has human-level accuracy for short social web texts in English, except political texts. SentiStrength reports two sentiment strengths: -1 (not negative) to -5 (extremely negative) 1 (not positive) to 5 (extremely positive).	English, Finnish, German, Dutch, Spanish, Russian, Portuguese, French, Arabic, Polish, Persian, Swedish, Greek, Welsh, Italian, Turkish	http://sentistrength.wlv.ac.uk/
Personality	Profiler Plus	Michael Young	Young, M. D. (2001). Building worldview (s) with Profiler+. Progress in communication sciences, 17-32.	Social Science Automation works with clients in government, business, and academia who benefit from the speed and accuracy of automated text analysis. By using the Profiler Plus Text Coding Platform and applying various coding schemes, they are able to answer real world questions for their business or industry. Profiler Plus can help you: Track sentiment in traditional/social media for brands, companies, people, or nations. Make more fully informed decisions about new athletic talent for your team. Use language analysis to evaluate threats. Mitigate risks by tracking and evaluating event data. Understand different cultures and make multi-lingual comparisons about them. Profile leaders of countries, corporations or any other individuals in positions of power.	English, Arabic, Spanish, Russian, Chinese	https://profilerplus.org/
LDA	Stanford	Daniel Ramage & Evan Rosen	Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics.	The Stanford Topic Modeling Toolbox (TMT) brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to: Import and manipulate text from cells in Excel and other spreadsheets. Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text. Select parameters (such as the number of topics) via a data-driven process. Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data.	All	https://nlp.stanford.edu/software/tmt/tmt-0.4/
LDA	R Packages	Johnathan Chang	Chang, J., & Chang, M. J. (2010). Package ‘lda’.	Implements latent Dirichlet allocation (LDA) and related models. This includes (but is not limited to) sLDA, corrLDA, and the mixed-membership stochastic blockmodel. Inference for all of these models is implemented via a fast collapsed Gibbs sampler written in C. Utility functions for reading/writing data typically used in topic models, as well as tools for examining posterior distributions are also included.	All	https://cran.r-project.org/web/packages/lda/lda.pdf
LDA	Mallet	Andrew McCallum	McCallum, Andrew Kachites. “MALLET: A Machine Learning for Language Toolkit.” http://mallet.cs.umass.edu. 2002.	Topic models provide a simple way to analyze large volumes of unlabeled text. A “topic” consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. For a general introduction to topic modeling, see for example Probabilistic Topic Models by Steyvers and Griffiths (2007).	All	http://mallet.cs.umass.edu/topics.php
LDA	Sprout	Zhiqiang Cai	Cai, Z., Graesser, A.C., Windsor, L., & Hu, X. (2017) Sprout: A semantic analysis tool for small corpora. Paper presented at the Forty-seventh annual meeting of the society for computer in psychology, Vancouver.; Cai, Z., Pennebaker, J. W., Eagan, B., Shaffer, D. W., Dowell, N. M., & Graesser, A. C. (2017). Epistemic network analysis and topic modeling for chat data from a collaborative learning environment. In X. Hu, T. Barnes, A. Hershkovitz, L. Paquette (Eds), Proceedings of the 10th International Conference on Educational Data Mining (pp.104-111). Wuhan, China: EDM Society.	Topic modeling seed method	English, Chinese	N/A
Epistemic Network Analysis	ENA	David Shaffer	Shaffer, D.W., Collier, W., & Ruis, A.R. (2016). A tutorial on epistemic network analysis: Analyzing the structure of connections in cognitive, social, and interaction data. Journal of Learning Analytics, 3(3), 9–45.		English	http://www.epistemicnetwork.org/
Text-as-Data tools	Plot Mapper	Beauchamp	N/A	This maps the key words and the progression of the text (its “plot”) in a two-dimensional space.	N/A	http://www.nickbeauchamp.com/projects/plotmapper.php
Text-as-Data tools	AutoScale	Beauchamp	N/A	This automatically scales a set of documents and their words according to the major dimensions of variation, some of which are often ideological.	N/A	http://www.nickbeauchamp.com/projects/pcascale.php
Text-as-Data tools	QuickTopics	Beauchamp	N/A	This characterizes a set of documents by finding a latent set of topics, along with the words associated with those topics.	N/A	http://www.nickbeauchamp.com/projects/topicmodel.php
Text-as-Data tools	Gscale	Beauchamp	N/A	This uses google search results to scale anything. You choose the search terms to scale (Senator names, eg) and two reference terms to scale them between (democrat, republican, eg), and gscale uses the google text results from latter two to scale the former.	N/A	http://www.nickbeauchamp.com/projects/gscale/gscale.php
Text-as-Data tools	Summarizer	Beauchamp	N/A	This “summarizes” a text by finding the N most representative sentences.	N/A	http://www.nickbeauchamp.com/projects/summarizer.php
Text-as-Data tools	Motifator	Beauchamp	N/A	This finds small clusters of related words (motifs) in the text.	N/A	http://www.nickbeauchamp.com/projects/motifator2.php
Text-as-Data tools	Text Predictor	Beauchamp	N/A	The generates more text in the style of the original text — although the syntax is a bit rough.	N/A	http://www.nickbeauchamp.com/projects/textpredictor3.php
Text-as-Data tools	Choose Your Own	Beauchamp	N/A	This is like Text Predictor, except you can choose which of the probably next words you want.	N/A	http://www.nickbeauchamp.com/projects/chooseyourown.php
Text-as-Data tools	Text Compare	Beauchamp	N/A	This allows you compare a set of different texts, ranking their overall similarities in various ways.	N/A	http://www.nickbeauchamp.com/projects/textcompare.php
Text-as-Data tools	Cliché Score	Beauchamp	N/A	This counts “cliches,” loosely speaking. It tracks the appearance of common 4-word sequences in your text.	English	http://www.nickbeauchamp.com/projects/clichescore.php
Text-as-Data tools	Word Counter	Beauchamp	N/A	Counts words.	N/A	http://www.nickbeauchamp.com/projects/wordcounter.php
Text-as-Data tools	Democratic Writing	Beauchamp	N/A	This allows a group of participants to collectively write a text using a nomination and voting system that ensures strictly democratic equality. (Due to a change in PHP since this was written in 2002, existing projects may be viewed, but new ones cannot be created or modified.)	N/A	http://www.nickbeauchamp.com/DW_index.php
Text-as-Data tools	Events	Will Lowe	Lowe W. (2012) ‘events: Store and manipulate event data’. R package version 0.5, URL http://cran.r-project.org/web/packages/events/	Events is an R package to make life a bit easier for people who analyse event data (that’s the kind of thing that KEDS/TABARI generates as output e.g. here). There’s no fancy stuff, just a logical interface to all the data massaging we do to event data before any actual analysis. The package also bundles the CAMEO and WEIS event code schemes, plus their Goldstein numerical scalings, and some event data from the Bosnian conflict.	N/A	http://conjugateprior.org/software/events/
Text-as-Data tools	Austin	Will Lowe	Will Lowe. 2015. Austin: Do things with words. Version 0.2.2 URL http://github.org/conjugateprior/austin	Austin is an R package for doing things with words. Right now that means scaling them in the style of Wordscores and Wordfish.	N/A	http://conjugateprior.org/software/austin/
Text-as-Data tools	Yoshikoder	Will Lowe	Lowe W. (2015) ‘Yoshikoder: Cross-platform multilingual content analysis’. Java software version 0.6.5, URL http://www.yoshikoder.org	The Yoshikoder is a cross-platform multilingual content analysis program developed as part of the Identity Project at Harvard‘s Weatherhead Center for International Affairs. You can load documents, construct and apply content analysis dictionaries, examine keywords-in-context, and perform basic content analyses, in any language.	N/A	http://conjugateprior.org/software/yoshikoder/
Text-as-Data tools	Jfreq	Will Lowe	Lowe W. (2011) ‘JFreq: Count words, quickly’. Java software version 0.5.4, URL http://www.conjugateprior.org/software/jfreq/	JFreq takes plain text documents and turns them into a word frequency matrix. It tries hard to be a) quick, and b) not take up much memory. It could be better at both, but it’s quite usable. The graphical version looks like this on a Mac with the online help open.	N/A	http://conjugateprior.org/software/jfreq/
Text-as-Data tools	YKConverter	Will Lowe	Lowe W. (2010) ‘YKConverter: Turn documents into texts’. Java software version 0.5, URL http://www.conjugateprior.org/software/ykconverter/	The YKConverter is a utility that tries to extract the text from documents in various formats (HTML, Word, PDF, Powerpoint, Excel) and save it as UTF-8 encoded text. You might do this to prepare for a subsequent content analysis.	N/A	http://conjugateprior.org/software/ykconverter/
Text-as-Data tools	Re-encoder	Will Lowe	Lowe W. (2012) ‘Re-encoder: Switch encodings’. Java software version 0.2, URL http://conjugateprior.org/software/reencoder/	Re-encoder take a folder full of files, assumes that they are encoded plain text, and saves each one into a different encoding. You specify the ‘from’ and ‘to’ encodings. If you’re not sure, there’s a preview function where you can experiment with different encodings until your file looks right.	N/A	http://conjugateprior.org/software/reencoder/
Text-as-Data resource	Content Analysis in Python	Will Lowe	N/A	This page is currently not much more than an extended advertisment for doing content analysis in Python. In time it might expand to a full tutorial, should anyone express interest in reading one. In the meantime it’ll hopefully just whet your appetite. The scripts presented here are not intended to teach programming; I assume you have at least a vague idea about that already. Nor are they intended to exemplify fine coding style. The point is to show how easy things can be, if you pick the right tools. Now, to business…	N/A	http://conjugateprior.org/software/ca-in-python/
Text-as-Data tools	VBPro	Mark Miller	Miller, M. (1997). VBPro [computer software].	VBPro is Mark Miller’s bundle of content analysis programs. Although development of VBPro has ceased, the most important algorithmic details of each program are in the process of being open sourced. This should help researchers replicating and extending analyses that use the software. In the meantime, the original VBPro distribution will be available from here. Thanks to Mark Miller for making this possible.	N/A	http://www.mariapinto.es/ciberabstracts/Articulos/VBPro.htm
Text-as-Data tools	Nvivo	Ashley Castleberry	Castleberry, A. (2014). NVivo 10 [software program]. Version 10. QSR International; 2012.	If you want to get an edge by better understanding the explosion of unstructured data in the world today, you need NVivo – powerful software for qualitative data analysis. Whether you are working individually or in a team, on Windows or Mac, are new to research or have years of experience, there’s an NVivo option to suit you. Can you afford to miss the insights your data is trying to show you?	English, French, German, Japanese, Portuguese, Simplified Chinese, Simplified Spanish	http://www.qsrinternational.com/product
Sentiment	Tone Analysis	IBM	N/A	The IBM Watson™ Tone Analyzer service uses linguistic analysis to detect three types of tones from written text: emotions, social tendencies, and writing style. Users can use the Tone Analyzer service to analyze conversations and communications. Use the output to respond to customers appropriately and craft the perfect message.	English	https://www.ibm.com/watson/developercloud/tone-analyzer.html
NLP	Natural Language Classifier	IBM	N/A	The Natural Language Classifier service understands the intent behind text and returns a corresponding classification, complete with a confidence score. For example “What is the weather like today? or “Is it hot out?” or “Is it going to be nice today?” are all ways of asking about “temperature”. Use NLC to answer questions in a contact center, create chatbots, categorize volumes of written content and more.	English, Arabic, French, German, Japanese, Italian, Portuguese, and Spanish	https://www.ibm.com/watson/developercloud/nl-classifier.html
NLP	WordNet	The Global WordNet Association	Fellbaum, C. (1998). WordNet. Blackwell Publishing Ltd.	A free, public and non-commercial organization that provides a platform for discussing, sharing and connecting wordnets for all languages in the world.	Afrikaans, Albanian, Arabic, Hindi, Indonesian, Japanese, Lao, Mongolian, Burmese, Nepali, Sinhala, Thaie, Vietnamese, Malaysian, Bantu languages, English, Spanish, Catalan, Basque, Italian, Bengali, Bulgarian, Czech, Greek, Romanian, Serbian, Turkish, Chinese (Traditional), Chinese (Simplified), Croation, Danish, Dutch, Estonian, Finnish, French, German, Hebrew, Hungarian, Icelandic, Latin, Irish, Assamese, Bodo, Gujarati, Kannada, Kashmiri, Konkani, Malayalam, Meitei, Marathi, Sanskrit, Tamil, Telugu, Punjabi, Urdu, Oriya, Korean, Kurdish, Latvian, Macedonian, Maltese, Moldavian, Norwegian, Persian, Polish, Russian, Serbian, Slovenian, Swedish, Tamil, Turkish	http://globalwordnet.org/wordnets-in-the-world/
Visual	Trelliscope	Ryan Hafan	Hafen, R. Interface, design, and computational considerations for D&R.	Trelliscope is a powerful visualization tool for large data that allows users to rapidly render customized graphics that are both detailed (i.e., drilled-down) and interpretable across large datasets — and then filter and sample the results to focus on examples that share common traits or characteristics (e.g., display only plots with correlation above 95% or display plots where regression model fit was poor). Trelliscope is also very useful for visualizing small datasets as well. It is unique in that it is highly customizable and is based on trellis displays which guide users to focus on relatively small, rational subsets of data. This approach helps data scientists develop machine learning algorithms and predictive analytics that account for the scientific phenomenology displayed in the graphics — an essential requirement in many environments.	Language Agnostic	https://signatures.pnnl.gov/software/trelliscope.stm
Semantic, Syntactic	Snowball	M.F. Porter	Porter, M. F. (2001). Snowball: A language for stemming algorithms.	Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. This site describes Snowball, and presents several useful stemmers which have been implemented using it. The Snowball compiler translates a Snowball script into another language – currently ISO C, Java and Python are supported.	English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Russian, Finnish	http://snowball.tartarus.org/texts/introduction.html
Visual	Motion Chart	Battista & Cheng	Battista, V., & Cheng, E. (2011). Motion charts: Telling stories with statistics. In American Statistical Association Joint Statistical Meetings (Vol. 4473).	A dynamic chart to explore several indicators over time. The chart is rendered within the browser using Flash.	Language Agnostic	https://developers.google.com/chart/interactive/docs/gallery/motionchart?csw=1
Translation	Google Translate	Google	N/A	Not official on website, but it translates.	Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bosnian, Bulgarian, Catalan, Cebuano, Chichewa, Chinese, Corsican, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Frisian, Galacian, Georgian, German, Greek, Gujarati, Hatian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scots Gaelic, Serbian, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Welsh, Xhosa, Yiddish, Yoruba, Zulu	https://translate.google.com/
Personality	AnalyzeWords	Roger Booth & Jamie Pennebaker	Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71(2001), 2001.	AnalyzeWords helps reveal your personality by looking at how you use words. It is based on good scientific research connecting word use to who people are. So go to town – enter your Twitter name or the handles of friends, lovers, or Hollywood celebrities to learn about their emotions, social styles, and the ways they think.	English, Chinese, Arabic, Spanish, Dutch, French, German, Italian,Russian and Turkish	http://www.analyzewords.com/
NLP	nltk	Steven Bird	Bird, S. (2006, July). NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 69-72). Association for Computational Linguistics.	NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.	50+ languages (Project Gutenberg sample)	http://www.nltk.org/
NLP	Stanford Core NLP	Christopher Manning	Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.		English, Arabic, Chinese, French, German, Spanish	http://stanfordnlp.github.io/CoreNLP/
NLP	spaCy	Matthew Honnibal	Choi, J. D., Tetreault, J., & Stent, A. (2015). It depends: Dependency parser comparison using a web-based evaluation tool. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 387-396).	spaCy excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managed Cython. Independent research has confirmed that spaCy is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using. spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it. It’s easy to install, and its API is simple and productive. I like to think of spaCy as the Ruby on Rails of Natural Language Processing. spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, Keras, Scikit-Learn, Gensim and the rest of Python’s awesome AI ecosystem. spaCy helps you connect the statistical models trained by these libraries to the rest of your application.	English, German, Chinese, Spanish, Italian, French, Portuguese, Dutch, Swedish, Finnish, Hungarian, Bengali, Hebrew	https://spacy.io/
Event Data	ICEWS	Phil Schrodt	Boschee, E., Lautenschlager, J., O’Brien, S., Shellman, S., Starz, J., & Ward, M. (2015). ICEWS coded event data. Harvard Dataverse, 5.	As one of the Assistant Secretary of Defense for Research and Engineering (ASD(R&E))’s Human Social, Culture and Behavior (HSCB) flagship programs, Integrated Crisis Early Warning System (ICEWS) has developed and is deploying a comprehensive, integrated, automated, generalizable, and validated system to monitor, assess, and forecast national and international crises in a way that supports decision-making on how to mitigate them. ICEWS provides Combatant Commanders (COCOMs), the IC, and various government agencies with a powerful, systematic capability to anticipate, track, and respond to stability challenges around the world.	English	http://www.lockheedmartin.com/us/products/W-ICEWS/W-ICEWS_Publications.html
Event Data	GDELT	Kalev Leetaru	Leetaru, K., & Schrodt, P. (1979). GDELT: Global Data of Events. Language, and Tone, 2012, 2013.	Supported by Google Jigsaw, the GDELT Project monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.	N/A	http://www.gdeltproject.org/
Event Data	CAMEO	Phil Schrodt	Gerner, D. J., Schrodt, P. A., Yilmaz, O., & Abu-Jabr, R. (2002). Conflict and Mediation Event Observations (CAMEO): A new event data framework for the analysis of foreign policy interactions. International Studies Association, New Orleans.	The Computational Event Data System is the current name for a series of projects beginning around 1998 that have focused on the machine coding of international event data using pattern recognition and simple grammatical parsing. These systems are designed to work with short news articles such as those found in wire service reports or chronologies. To date, the software has primarily been used to code events from Reuters and Agence France Presse wire service lead sentences but in principle it can be used for other event coding schemes.	N/A	http://eventdata.parusanalytics.com/data.dir/cameo.html
Event Data	WEIS	Phil Schrodt	Schrodt, P. A., & Leibsohn, D. (1985). An Algorithm for the Classification of WEIS Event Code from WEIS Textual Descriptions. International Studies Association, Washington.	Most of the codes that are used in the data sets produced by TABARI are the standard WEIS codes originally developed by Charles McClelland (see “World Event/Interaction Survey (WEIS) Project, 1966-1978”, ICPSR Study No. 5211) However, at various points we have experimented with introducing new codes into WEIS, borrowing most of these from the PANDA project. We assigned weights to the new codes that are comparable to the weights used in the Goldstein scale, and those weights are used in the aggregated data. The full list of these codes can be found at the following two links.	N/A	http://eventdata.parusanalytics.com/data.dir/weis.html
Event Data	KEDS	Phil Schrodt	Schrodt, P. A., Davis, S. G., & Weddle, J. L. (1994). Political science: KEDS—a program for the machine coding of event data. Social Science Computer Review, 12(4), 561-587.	While not as current as TABARI, the KEDS (Kansas Event Data System) program was our first major foray into the automation of events data. The level of documentation and development of the program surpasses that of TABARI, even though the latter is, in most respects, a more capable program. The KEDS program runs natively on Mac OS 6.0 or later and uses a limited amount of disk space. Additionally, KEDS supports a nifty GUI interface that is to date absent in TABARI.	N/A	http://eventdata.parusanalytics.com/software.dir/keds.html
Event Data	Open Event Data Alliance	Phil Schrodt, John Beieler, Patrick Brandt, Erin Simpson, & Andy Halterman	N/A	The prime objective of the OEDA is to provide reliable, open access, multi-sourced political event datasets that are updated at least weekly, are transparent and have documented source texts, and use one or more of the open coding ontologies supported by the organization. As an organization, OEDA will aggregate, rather than generate, such data—in particular we expect to be linking to multiple data sets. These datasets will share a common format or be supported by open software that will translate them into a common format. Data generated with open source coding engines and dictionaries are preferred, but the organization is open to proprietary coding methods, provided the resulting data are open access, documented, and clear of intellectual property issues. The OEDA does not seek to establish any definitive imprimatur but rather to provide guidance for voluntary solutions to coordination problems on issues and resources of common concerns.	N/A	http://openeventdata.org/#about
Event Data	Petrarch	C. Norris	Norris, C. (2016). Petrarch 2: Petrarcher. arXiv preprint arXiv:1602.07236.	Code for the new Python Engine for Text Resolution And Related Coding Hierarchy (PETRARCH) event data coder. The coder now has all of the functions from the older TABARI coder and the new CAMEO.verbpatterns.140609.txt dictionary incorporates both parser-based matching and extensive synonym sets. The program coded 60,000 AFP sentences from the GigaWord corpus without crashing, using the included dictionaries.	N/A	http://openeventdata.org/
Event Data	Phoenix	Phil Schrodt	Schrodt, P. A. (2014). Phoenix Event Data Set Documentation.	The Phoenix dataset is a new, near real-time event dataset created using the next-generation event data coding software, PETRARCH. The data is generated using news content scraped from over 400 sources. This scraped content is run through a processing pipeline that produces coded event data as a final output. Our current settings produce roughly 3,000 coded events per day. These coded events are in the standard who-did-what-to-whom format typically associated with event data. Each event is coded along on multiple dimensions, specifically source and target actors and event type. These dimensions are described in greater detail below.	N/A	http://phoenixdata.org/
Event Data	Plover	Phil Schrodt	N/A	PLOVER–Political Language Ontology for Verifiable Event Records–is a next generation political event coding specification under development by the Open Event Data Alliance (http://openeventdata.org/) which is intended to replace the earlier [CAMEO] (http://eventdata.parusanalytics.com/data.dir/cameo.html) system.	N/A	https://github.com/openeventdata/PLOVER
NLP	Trint	Adam Trent, Peter J. Kofman, & Puneet Kukkal	Trent, A., Kaufman, P. J., & Kukkal, P. (1999). U.S. Patent No. 5,961,620. Washington, DC: U.S. Patent and Trademark Office.		North American English, British English, Australian English, French, Spanish, German, Italian, Portuguese, Russian, Polish, Finnish, Hungarian, Dutch, Romanian, Swedish	https://trint.com/
Mixed Method Data Analyis	QDA Miner	Provalis Research	Lewis, R. B., & Maas, S. M. (2007). QDA Miner 2.0: Mixed-model qualitative data analysis software. Field methods, 19(1), 87-108.	QDA Miner is an easy-to-use qualitative data analysis software package for coding, annotating, retrieving and analyzing small and large collections of documents and images. QDA Miner qualitative data analysis tool may be used to analyze interview or focus group transcripts, legal documents, journal articles, speeches, even entire books, as well as drawings, photographs, paintings, and other types of visual documents. Its seamless integration with SimStat, a statistical data analysis tool, and WordStat, a quantitative content analysis and text mining module, gives you unprecedented flexibility for analyzing text and relating its content to structured information including numerical and categorical data.	English, French, Spanish	https://provalisresearch.com/products/qualitative-data-analysis-software/
Text as Data Tools	WordStat	Provalis Research	Péladeau, N. (2003). WordStat: Content analysis module for SIMSTAT. Montréal: Provalis Research.	WordStat is a flexible and easy-to-use texWordStat is a flexible and easy-to-use text analysis software – whether you need text mining tools for fast extraction of themes and trends, or careful and precise measurement with state-of-the-art quantitative content analysis tools. WordStat‘s seamless integration with SimStat – our statistical data analysis tool – QDA Miner – our qualitative data analysis software – and Stata – the comprensive statistical software from StataCorp, gives you unprecedented flexibility for analyzing text and relating its content to structured information, including numerical and categorical data.t analysis software – whether you need text mining tools for fast extraction of themes and trends, or careful and precise measurement with state-of-the-art quantitative content analysis tools. WordStat‘s seamless integration with SimStat – our statistical data analysis tool – QDA Miner – our qualitative data analysis software – and Stata – the comprensive statistical software from StataCorp, gives you unprecedented flexibility for analyzing text and relating its content to structured information, including numerical and categorical data.	English, French, Italian, German, and Spanish	https://provalisresearch.com/products/content-analysis-software/
Statistical Analysis	SimStat	Provalis Research	Péladeau, N. (1996). SIMSTAT for Windows. Provalis Research, Montreal.	Simstat goes beyond mere statistical analysis. It offers output management features not found in any other program, as well as its own scripting language to automate statistical analysis and to write small applications, interactive tutorials with multimedia capabilities, as well as computer assisted interviewing systems.	English	https://provalisresearch.com/products/simstat/
Mixed Method Data Analyis	ProSuite	Provalis Research	Provalis Research (2014). ProSuite, Montréal. http://provalisresearch.com/products/prosuite/ William S. Hein & Company World constitutions. HeinOnline.	ProSuite is an integrated collection of Provalis Research text analytics tools that allow one to explore, analyze and relate both structured and unstructured data. Provalis Research Text Analytics Tools allows one to perform advanced computer assisted qualitative coding on documents and images using QDA Miner, to apply the powerful content analysis and text mining features of WordStat on textual data, and to perform advanced statistical analysis on numerical and categorical data using SimStat.		https://provalisresearch.com/products/prosuite-text-analytics-tools/
Text as Data Tools	Seance	Kristopher Kyle	Crossley, S. A., Kyle, K., & McNamara, D. S. (2017). Sentiment analysis and social cognition engine (SEANCE): An automatic tool for sentiment, social cognition, and social order analysis. Behavior Research Methods 49(3), pp. 803-821. doi:10.3758/s13428-016-0743-z.	SEANCE is an easy to use tool that includes 254 core indices and 20 component indices based on recent advances in sentiment analysis. In addition to the core indices, SEANCE allows for a number of customized indices including filtering for particular parts of speech and controlling for instances of negation. SEANCE takes plain text files as input (it will process all plain text files in a particular folder) and produces a comma separated values (.csv) spreadsheet that is easily read by any spreadsheet software.	English	http://www.kristopherkyle.com/seance.html
Text as Data Tools	CLA	Kristopher Kyle	Kyle, K., Crossley, S.A., & Kim, Y. J. (2015). Native Language Identification and Writing Proficiency. International Journal of Learner Corpus Research 1(2), pp. 187-209. doi: 10.1075/ijlcr.1.2.01kyl.	CLA is a simple but powerful text analysis tool. One can use CLA to analyze texts using very large custom dictionaries. In addition to words, custom dictionaries can include n-grams and wildcards.	English	http://www.kristopherkyle.com/cla.html
Text as Data Tools	CRAT	Kristopher Kyle	Crossley, S. A, Kyle, K., Davenport, J., & McNamara, D. S. (2016). Automatic assessment of constructed response data in a chemistry tutor. In T. Barnes, M. Chi, & M. Feng (Eds.), Proceedings of the 9th International Educational Data Mining (EDM) Society Conference (pp. 336-340).	CRAT is an easy to use tool that includes over 700 indices related to lexical sophistication, cohesion and source text/summary text overlap. CRAT is particularly well suited for the exploration of writing quality as it relates to summary writing.	English	http://www.kristopherkyle.com/crat.html
Text as Data Tools	SiNLP	Kristopher Kyle	Crossley, S. A., Allen, L. K., Kyle, K., & McNamara, D.S. (2014). Analyzing discourse processing using a simple natural language processing tool (SiNLP). Discourse Processes, 51(5-6), pp. 511-534, DOI: 10.1080/0163853X.2014.910723	SiNLP is a simple tool that allows users to analyze texts with regard to the number of words, number of types, TTR, letters per word, number of paragraphs, number of sentences, and number of words per sentence for each text. In addition, users can analyze texts with regard to their own custom dictionaries.	English	http://www.kristopherkyle.com/sinlp.html
Text as Data Tools	TAACO	Kristopher Kyle	Crossley, S. A., Kyle, K., & McNamara, D. S. (2016). The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior Research Methods 48(4), pp. 1227-1237. doi:10.3758/s13428-015-0651-7	TAACO is an easy to use tool that calculates 150 indices of both local and global cohesion, including a number of type-token ratio indices, adjacent overlap indices, and connectives indices.	English	http://www.kristopherkyle.com/taaco.html
Text as Data Tools	TAALES	Kristopher Kyle	Kyle, K. & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly 49(4), pp. 757-786. doi: 10.1002/tesq.194; Kyle, K., Crossley, S. A., & Berger, C. (in press). The tool for the analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods.	TAALES TAALES is a tool that measures over 400 classic and new indices of lexical sophistication, and includes indices related to a wide range of sub-constructs. Included are indices for both single words and n-grams. Starting with version 2.2, TAALES also provides comprehensive index diagnostics.	English	http://www.kristopherkyle.com/taales.html
Text as Data Tools	TAASSC	Kristopher Kyle	Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication (Doctoral Dissertation). Retrieved from http://scholarworks.gsu.edu/alesl_diss/35.; Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4):474-496.	TAASSC is an advanced syntactic analysis tool that measures fine-grained indices of clausal and phrasal complexity, classic indices of syntactic complexity, and frequency-based verb argument construction indices.	English	http://www.kristopherkyle.com/taassc.html
Audio	OpenSmile	Florian Eyben, Martin Wöllmer, & Björn Schuller	Eyben, F., Wöllmer, M., & Schuller, B. (2010, October). Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia (pp. 1459-1462). ACM.	We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies are supported. Delta regression and various statistical functionals can be applied to the low-level descriptors. openSMILE is implemented in C++ with no third-party dependencies for the core functionality. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. It supports on-line incremental processing for all implemented features as well as off-line and batch processing. Numeric compatibility with future versions is ensured by means of unit tests. openSMILE can be downloaded from http://opensmile.sourceforge.net/.		https://github.com/naxingyu/opensmile
Text-as-Data resource	Wordfish	Sven-Oliver Proksch & Jonathan B. lapin	Proksch, S. O., & Slapin, J. B. (2008). WORDFISH: Scaling software for estimating political positions from texts. Version, 1, 323-344.	Wordfish is a computer program written in the R statistical language to extract political positions from text documents. Word frequencies are used to place documents onto a single dimension. Wordfish is a scaling technique and does not need any anchoring documents to perform the analysis. Instead, it relies on a statistical model of word counts. The current implementation assumes a Poisson distribution of word frequencies. Positions are estimated using an expectation-maximization algorithm. Confidence intervals for estimated positions can be generated from a parametric bootstrap.The name Wordfish pays tribute to the French meaning of the word “poisson”.	N/A	http://www.wordfish.org/
Wordshoal	Lauderdale, B. E., & Herzog, A. (2016). Measuring political positions from legislative speech. Political Analysis, 24(3), 374-394.	Benjamin E. Lauderdale & Alexander Herzog	Lauderdale, B. E., & Herzog, A. (2016). Measuring political positions from legislative speech. Political Analysis, 24(3), 374-394.	Existing approaches to measuring political disagreement from text data perform poorly except when applied to narrowly selected texts discussing the same issues and written in the same style. We demonstrate the first viable approach for estimating legislator-specific scores from the entire speech corpus of a legislature, while also producing extensive information about the evolution of speech polarization and politically loaded language. In the Irish Da’ il, we show that the dominant dimension of speech variation is government-opposition, with ministers more extreme on this dimension than backbenchers, and a second dimension distinguishing between the establishment and anti-establishment opposition parties. In the U.S. Senate, we estimate a dimension that has moderate within-party correlations with scales based on roll-call votes and campaign donation patterns; however, we observe greater overlap across parties in speech positions than roll-call positions and partisan polarization in speeches varies more clearly in response to major political events.	N/A	https://github.com/kbenoit/wordshoal
NLP	Apache OpenNLP	Apache Software Foundation	Apache Software Foundation	OpenNLP supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution.	Afrikaans, Arabic, Asturian, Azerbaijani, Bashkir, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Catalan, Cebuano, Czech, Chechen, Mandarin Chinese, Welsh, Danish, German, Standard Estonian, Greek(Modern), English, Esperanto, Estonian, Basque, Faroese, Persian, Finnish, French, Western Frisian, Irish, Galician, Swiss German, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Armenian, Indonesian, Icelandic, Italian, Javanese, Japanese, Kannada, Georgian, Kazakh, Kirghiz, Korean, Latin, Latvian, Limburgan, Lithuanian, Luxembourgish, Standard Latvian, Malayalam, Marathi, Minangkabau, Macedonian, Maltese, Mongolian, Maori, Malay, Min Nan Chinese, Low German, Nepali, Dutch, Norwegian Nynorsk, Norwegian BokmÃ¥l, Occitan, Panjabi, Iranian Persian, Plateau Malagasy, Western Panjabi, Polish, Portuguese, Pushto, Romanian, Russian, Sanskrit, Sinhala, Slovak, Slovenian, Somali, Spanish, Albanian, Serbian, Sundanese, Swahili, Swedish, Tamil, Tatar, Telugu, Tajik, Tagalog, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, VolapÃ¼k, Waray, Zulu	https://opennlp.apache.org/