How the quotations were identified
This project uses techniques from a field called machine learning to identify the quotations or verbal allusions in the newspaper pages. Below is a brief, mostly non-technical explanation of how this works.
Labeling data and training the model
After measuring the potential matches, we need a means of distinguishing between accurate matches and false positives. This is a difficult problem because of the way that the Bible was quoted in newspapers (or indeed, used more generally). If we were looking for complete quotations, then we would look for candidates where there were many matching tokens, or where a high proportion of the matching verse is present on the page. But often quotations can be highly compressed. A single unusual phrase (“Quench not the Spirit” or “Remember Lot’s wife” or “The Lord called Samuel”) may be enough to identify one quotation, where even a half dozen commonplace matching phrases might not actually be a quotation. Then too, sometimes allusions function by changing the actual words while retaining the syntax or cadence, as in this joke.
“Jug not, lest ye be jugged,” alluding to the verse “Judge not, that ye be not judged” (Matthew 7:1).
Rather than specify arbitrary thresholds, a more accurate approach is to teach an algorithm to distinguish between quotations and noise by showing it what many genuine matches and false positives look like. After taking a sample of potential matches, I identified some 1,700 possible matches as either genuine or not. (You can see the labeled data here.) This makes it possible to observe patterns in the features that have been measured. In the charts below, for instance, show that genuine matches tend to have a much higher token count, a much higher TF-IDF score, and a very low p-value for the runs test. But it is not possible to draw a single line on either chart which cleanly distinguishes between all genuine matches and all false positives.
I then used that data to train and test a machine learning model. This model takes the predictors mentioned above, and assigns it a class (“quotation” or not) and a probability that that classification is correct. While I evaluated a number of models, including random forests, support vector machines, and ensembles of other models, a neural network classifier had the best performance. I measured accuracy using the area under the receiver operating characteristic curve. The idea there is simple: the best classifier is the one that maximizes the number of genuine matches while minimizing the number of false positives.
The following is a brief list of secondary sources on the history of the Bible in America:
Byrd, James P. Sacred Scripture, Sacred War: The Bible and the American Revolution. New York: Oxford University Press, 2013.
Callahan, Allen Dwight. The Talking Book: African Americans and the Bible. New Haven: Yale University Press, 2006.
Fea, John. The Bible Cause: A History of the American Bible Society. New York: Oxford University Press, 2016.
Gutjahr, Paul C. An American Bible: A History of the Good Book in the United States, 1777-1880. Stanford, CA: Stanford University Press, 1999.
Hatch, Nathan O., and Mark A. Noll, eds. The Bible in America: Essays in Cultural History. New York: Oxford University Press, 1982.
McDannell, Colleen. Material Christianity: Religion and Popular Culture in America. New Haven: Yale University Press, 1995.
Noll, Mark A. In the Beginning Was the Word: The Bible in American Public Life, 1492-1783. Oxford ; New York: Oxford University Press, 2016.
Nord, David Paul. Faith in Reading: Religious Publishing and the Birth of Mass Media in America. New York: Oxford University Press, 2004.
Sarna, Jonathan D., and Nahum M. Sarna. “Jewish Bible Scholarship and Translations in the United States.” In The Bible and Bibles in America, edited by Ernest S. Frerichs, 83–116. Atlanta: Scholars Press, 1988.
Stein, Stephen J. “America’s Bibles: Canon, Commentary, and Community,” Church History 64, no. 2 (June 1, 1995): 169–84, doi:10.2307/3167903.
Thuesen, Peter J. In Discordance with the Scriptures: American Protestant Battles Over Translating the Bible. New York: Oxford University Press, 1999.
I have also made use of the following software or works on machine learning in particular:
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning with Applications in R. New York: Springer, 2013.
Kuhn, Max. Applied Predictive Modeling. New York: Springer, 2013.
All of the code for this project was written in R, using the following packages.
Bates, Douglas, and Martin Maechler. “Matrix: Sparse and Dense Matrix Classes and Methods.” R package version 1.2-6. 2016. https://CRAN.R-project.org/package=Matrix
Chang, Winston, Joe Cheng , JJ Allaire, Yihui Xie, Jonathan McPherson, et al. “shiny: Web Application Framework for R.” R package version 0.13.2.9004. http://shiny.rstudio.com
Grolemund, Garrett, Vitalie Spinu, Hadley Wickham, et al. “lubridate: Make Dealing with Dates a Little Easier.” R package version 1.5.6. 2016. https://CRAN.R-project.org/package=lubridate
Kuhn, Max, et al. “caret: Classification and Regression Training.” R package version 6.0-68. 2016 https://CRAN.R-project.org/package=caret
Mullen, Lincoln. “tokenizers: Tokenize Text.” R package version 0.1.2. https://CRAN.R-project.org/package=tokenizers
R Core Team. “R: A language and environment for statistical computing.” R Foundation for Statistical Computing, Vienna, Austria. 2016. https://www.R-project.org/.
Ripley, Brian, and William Venables. “nnet: Feed-Forward Neural Networks and Multinomial Log-Linear Models.” R package version 7.3-12. 2016. https://CRAN.R-project.org/package=nnet
Robinson, David, et al. “broom: Convert Statistical Analysis Objects into Tidy Data Frames.” R package version 0.4.0. 2016. https://CRAN.R-project.org/package=broom
Ryan, Jeffrey A., and Joshua M. Ulrich. “xts: eXtensible Time Series.” R package version 0.9-7. 2014. https://CRAN.R-project.org/package=xts
Selivanov, Dmitriy. “text2vec: Fast Text Mining Framework for Vectorization and Word Embeddings.”" R package version 0.3.0. 2016. https://CRAN.R-project.org/package=text2vec
Trapletti, Adrian, and Kurt Hornik. “tseries: Time Series Analysis and Computational Finance.” R package version 0.10-35. 2016. https://CRAN.R-project.org/package=tseries
Vanderkam, Dan, JJ Allaire, et al. “dygraphs: Interface to ‘Dygraphs’ Interactive Time Series Charting Library.” R package version 0.9. 2016. https://CRAN.R-project.org/package=dygraphs
Wickham, Hadley, and Romain Francois. “dplyr: A Grammar of Data Manipulation.” R package version 0.4.3. 2016. https://CRAN.R-project.org/package=dplyr
Wickham, Hadley, and Winston Chang. “ggplot2: An Implementation of the Grammar of Graphics.” R package version 2.1.0. 2016. https://CRAN.R-project.org/package=ggplot2
Wickham, Hadley, et al. “purrr: Functional Programming Tools.” R package version 0.2.1. 2016. https://CRAN.R-project.org/package=purrr
Wickham, Hadley. “stringr: Simple, Consistent Wrappers for Common String Operations.” R package version 1.0.0. 2016. https://CRAN.R-project.org/package=stringr
Wickham, Hadley. “tidyr: Easily Tidy Data” R package version 0.4.1. 2016. https://CRAN.R-project.org/package=tidyr