Data Wrangling in R: Regular Expressions

setwd("../data")
load("datasets_L06.Rda")

# This lecture uses the following packages:

# install.packages("stringr")
library(stringr)
# install.packages("qdapRegex")
library(qdapRegex)


# Intro -------------------------------------------------------------------

# A 'regular expression' is a pattern that describes a set of strings. 
# Examples:

# - all 5-digit numbers in a document 
# - all 5-digit numbers ending in 00
# - words spelled in ALL CAPS
# - words in brackets or delimiters [],<>,(),{}
# - words at the end of a sentence
# - all email addresses
# - dates in a certain format

# These are examples of string patterns. Regular Expressions are the language we
# use to describe the pattern. You should know, however, regular expressions are
# a language into itself. There are entire books devoted to regular expressions.

# Quote floating around internet: "Some people, when confronted with a problem, 
# think 'I know, I'll use regular expressions.' Now they have two problems." 
# Regular expressions can be tricky to get right, especially for complex
# patterns.

# We will only dabble in regular expressions. Key lesson: recognize when you 
# need a regular expression and know enough to cobble one together using your
# knowledge, wits and Google.

# Two PDF files you may want to download and save for reference:
# http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf
# http://gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf

# Good pages to print off/bookmark:
# http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
# http://regexlib.com/CheatSheet.aspx
# or just Google "regex cheatsheet"

# Good library book:
# Go to virgo, search for "Regular expressions cookbook"

# RegEx tutorials: 
# http://www.rexegg.com/
# http://www.regular-expressions.info/

# Regular Expression Basics -----------------------------------------------

# Regular expressions are composed of three components:

# (1) literal characters
# (2) modifiers (or metacharacters)
# (3) character classes

# (1) LITERAL CHARACTERS 

# These are the literal characters you want to match. If you want to find the
# word "factor", you search for "factor", the literal characters.

# (2) MODIFIERS

# Modifiers define patterns;
# meet the modifiers:
# $ * + . ? [ ] ^ { } | ( ) \

# precede these with double backslash (in R!) if you want to treat them as
# literal characters.

# ^  start of string
# $  end of string
# .  any character except new line
# *  0 or more
# +  1 or more
# ?  0 or 1
# |  or (alternative patterns)
# {} quantifier brackets: exactly {n}; at least {n,}; between {n,m}
# () group patterns together
# \  escape character (needs to be escaped itself in R: \\)
# [] character class brackets (not to be confused with R's subsetting brackets!)


# (3) CHARACTER CLASSES
# a range of characters to be matched;
# placed in brackets: []
# For example: [a-q] means all letters from a - q;
# [a-zA-Z] means all alphabetic characters;
# [0-9A-Za-z] means all alphanumeric characters;
# The ^ symbol means "not" when used in brackets, so [^abc] means "Not (a or b
# or c)"

# From R documentation: "Because their interpretation is locale- and 
# implementation-dependent, character ranges are best avoided." Good advice if
# you're sharing R code. Otherwise, fine to use on your own.

# PREDEFINED CHARACTER CLASSES

# [:lower:] - Lower-case letters in the current locale. [a-z]
# 
# [:upper:] - Upper-case letters in the current locale. [A-Z]
# 
# [:alpha:] - Alphabetic characters: [:lower:] and [:upper:]. [a-zA-Z]
# 
# [:digit:] - Digits: 0 1 2 3 4 5 6 7 8 9. [0-9]
# 
# [:alnum:] - Alphanumeric characters: [:alpha:] and [:digit:]. [0-9A-Za-z]
# 
# [:punct:] - Punctuation characters: ! " # $ % & ' ( ) * + , - . / : ; < = > ?
# @ [ \ ] ^ _ ` { | } ~.
# 
# [:graph:] - Graphical characters: [:alnum:] and [:punct:].
# 
# [:blank:] - Blank characters: space and tab, and possibly other
# locale-dependent characters such as non-breaking space.
# 
# [:space:] - Space characters: tab, newline, vertical tab, form feed, carriage
# return, space and possibly other locale-dependent characters.
# 
# [:print:] - Printable characters: [:alnum:], [:punct:] and space.

# Note that the brackets in these class names are part of the symbolic names, 
# and must be included in addition to the brackets delimiting the bracket list!

# More regex codes! (Yay! More stuff!) Be sure to escape that backslash!

# \b - Word boundary
# \d - any decimal digit
# \w - any word character
# \s - any white-space character
# \n - a New line

# see ?regex for an indepth overview of regular expressions.

# RegEx examples ----------------------------------------------------------

# Let's create some sample text to demonstrate regular expressions:

someText <- c("  here's a sentence", 
              "This is me typing at 2:02 in the morning",
              "Back in 1995 I was only 22.",
              "You saw 4 monkeys?", 
              "There are 10 kinds of people,    those that understand binary       
              and the other 9 that don't care",
              "Who likes pancakes? I do. I really really like pancakes!", 
              "     <strong>Bolded text is bold and daring</strong>",
              "figure1.jpg", "cover.jpg", "report.pdf", "log.txt",
              "I'm a one man wolfpack and I weigh 222",
              "OMG, a 3-eyed cyclops!!!",
              "2112 is an awesome album.",
              "2222 is my PIN")
someText

##  [1] "  here's a sentence"                                                                                                
##  [2] "This is me typing at 2:02 in the morning"                                                                           
##  [3] "Back in 1995 I was only 22."                                                                                        
##  [4] "You saw 4 monkeys?"                                                                                                 
##  [5] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"
##  [6] "Who likes pancakes? I do. I really really like pancakes!"                                                           
##  [7] "     <strong>Bolded text is bold and daring</strong>"                                                               
##  [8] "figure1.jpg"                                                                                                        
##  [9] "cover.jpg"                                                                                                          
## [10] "report.pdf"                                                                                                         
## [11] "log.txt"                                                                                                            
## [12] "I'm a one man wolfpack and I weigh 222"                                                                             
## [13] "OMG, a 3-eyed cyclops!!!"                                                                                           
## [14] "2112 is an awesome album."                                                                                          
## [15] "2222 is my PIN"

# Examples of SUPER BASIC regex patterns:

# find elements in vector beginning with 1 or more spaces
grep("^ +", someText, value=T)

## [1] "  here's a sentence"                                 
## [2] "     <strong>Bolded text is bold and daring</strong>"

grep("^[[:blank:]]+", someText, value=T)

## [1] "  here's a sentence"                                 
## [2] "     <strong>Bolded text is bold and daring</strong>"

# find elements containing a question mark; need to "escape" the "?"
grep("\\?", someText, value=T)

## [1] "You saw 4 monkeys?"                                      
## [2] "Who likes pancakes? I do. I really really like pancakes!"

# find elements ending with a question mark
grep("\\?$", someText, value=T)

## [1] "You saw 4 monkeys?"

# find elements containing one or more numbers
grep("[0-9]+", someText, value=T)

## [1] "This is me typing at 2:02 in the morning"                                                                           
## [2] "Back in 1995 I was only 22."                                                                                        
## [3] "You saw 4 monkeys?"                                                                                                 
## [4] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"
## [5] "figure1.jpg"                                                                                                        
## [6] "I'm a one man wolfpack and I weigh 222"                                                                             
## [7] "OMG, a 3-eyed cyclops!!!"                                                                                           
## [8] "2112 is an awesome album."                                                                                          
## [9] "2222 is my PIN"

grep("[[:digit:]]+", someText, value=T)

## [1] "This is me typing at 2:02 in the morning"                                                                           
## [2] "Back in 1995 I was only 22."                                                                                        
## [3] "You saw 4 monkeys?"                                                                                                 
## [4] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"
## [5] "figure1.jpg"                                                                                                        
## [6] "I'm a one man wolfpack and I weigh 222"                                                                             
## [7] "OMG, a 3-eyed cyclops!!!"                                                                                           
## [8] "2112 is an awesome album."                                                                                          
## [9] "2222 is my PIN"

# find elements containing numbers with 2 digits
grep("[0-9]{2}", someText, value=T)

## [1] "This is me typing at 2:02 in the morning"                                                                           
## [2] "Back in 1995 I was only 22."                                                                                        
## [3] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"
## [4] "I'm a one man wolfpack and I weigh 222"                                                                             
## [5] "2112 is an awesome album."                                                                                          
## [6] "2222 is my PIN"

grep("[[:digit:]]{2}", someText, value=T)

## [1] "This is me typing at 2:02 in the morning"                                                                           
## [2] "Back in 1995 I was only 22."                                                                                        
## [3] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"
## [4] "I'm a one man wolfpack and I weigh 222"                                                                             
## [5] "2112 is an awesome album."                                                                                          
## [6] "2222 is my PIN"

# text ending with .jpg; need to escape the "."
grep("\\.jpg$", someText, value=T)

## [1] "figure1.jpg" "cover.jpg"

# text ending with a 3=character file extension
grep("\\.[[:alpha:]]{3}$", someText, value=T)

## [1] "figure1.jpg" "cover.jpg"   "report.pdf"  "log.txt"

grep("\\.\\w{3}$", someText, value=T)

## [1] "figure1.jpg" "cover.jpg"   "report.pdf"  "log.txt"

# text beginning with only letters, and containing only letters, ending in .jpg
grep("^[a-zA-Z]+\\.jpg", someText, value=T)

## [1] "cover.jpg"

grep("^[[:alpha:]]+\\.jpg", someText, value=T)

## [1] "cover.jpg"

# text containing two consecutive "really "
grep("(really ){2}",someText, value=T)

## [1] "Who likes pancakes? I do. I really really like pancakes!"

# text containing two or more !
grep("!{2,}",someText, value=T)

## [1] "OMG, a 3-eyed cyclops!!!"

# Contraction beginning with 3 letters
grep(" [[:alpha:]]{3}'", someText, value = T)

## [1] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"

grep("\\b[[:alpha:]]{3}'", someText, value = T)

## [1] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"

# text with 3-character words
grep("\\b\\w{3}\\b", someText, value = T)

##  [1] "This is me typing at 2:02 in the morning"                                                                           
##  [2] "Back in 1995 I was only 22."                                                                                        
##  [3] "You saw 4 monkeys?"                                                                                                 
##  [4] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"
##  [5] "Who likes pancakes? I do. I really really like pancakes!"                                                           
##  [6] "     <strong>Bolded text is bold and daring</strong>"                                                               
##  [7] "figure1.jpg"                                                                                                        
##  [8] "cover.jpg"                                                                                                          
##  [9] "report.pdf"                                                                                                         
## [10] "log.txt"                                                                                                            
## [11] "I'm a one man wolfpack and I weigh 222"                                                                             
## [12] "OMG, a 3-eyed cyclops!!!"                                                                                           
## [13] "2222 is my PIN"

# text with 3-character words but no file names
grep("\\b\\w{3}\\b[^[:punct:]]", someText, value = T)

## [1] "This is me typing at 2:02 in the morning"                                                                           
## [2] "Back in 1995 I was only 22."                                                                                        
## [3] "You saw 4 monkeys?"                                                                                                 
## [4] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"
## [5] "Who likes pancakes? I do. I really really like pancakes!"                                                           
## [6] "     <strong>Bolded text is bold and daring</strong>"                                                               
## [7] "I'm a one man wolfpack and I weigh 222"

# text with ALL CAPS (two or more CAPS)
grep("\\b[[:upper:]]{2,}\\b", someText, value = T)

## [1] "OMG, a 3-eyed cyclops!!!" "2222 is my PIN"

# text with a new line
grep("\\n", someText, value = T)

## [1] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"

# matching 0 or more times
grep("2*2", someText, value = T)

## [1] "This is me typing at 2:02 in the morning"
## [2] "Back in 1995 I was only 22."             
## [3] "I'm a one man wolfpack and I weigh 222"  
## [4] "2112 is an awesome album."               
## [5] "2222 is my PIN"

# matching 1 or more times
grep("2+2", someText, value = T)

## [1] "Back in 1995 I was only 22."           
## [2] "I'm a one man wolfpack and I weigh 222"
## [3] "2222 is my PIN"

# Search/Replace with RegEx -----------------------------------------------

# Recall sub() and gsub() functions. These perform replacement of the first and 
# all matches respectively. In a previous lecture we used them to search/replace
# literal strings. Now let's use them with regular expressions. A few examples:

# Replace Repeated Whitespace with a Single Space
gsub(" +"," ", someText)

##  [1] " here's a sentence"                                                                           
##  [2] "This is me typing at 2:02 in the morning"                                                     
##  [3] "Back in 1995 I was only 22."                                                                  
##  [4] "You saw 4 monkeys?"                                                                           
##  [5] "There are 10 kinds of people, those that understand binary \n and the other 9 that don't care"
##  [6] "Who likes pancakes? I do. I really really like pancakes!"                                     
##  [7] " <strong>Bolded text is bold and daring</strong>"                                             
##  [8] "figure1.jpg"                                                                                  
##  [9] "cover.jpg"                                                                                    
## [10] "report.pdf"                                                                                   
## [11] "log.txt"                                                                                      
## [12] "I'm a one man wolfpack and I weigh 222"                                                       
## [13] "OMG, a 3-eyed cyclops!!!"                                                                     
## [14] "2112 is an awesome album."                                                                    
## [15] "2222 is my PIN"

gsub("\\s+"," ",someText) # removes \n!

##  [1] " here's a sentence"                                                                        
##  [2] "This is me typing at 2:02 in the morning"                                                  
##  [3] "Back in 1995 I was only 22."                                                               
##  [4] "You saw 4 monkeys?"                                                                        
##  [5] "There are 10 kinds of people, those that understand binary and the other 9 that don't care"
##  [6] "Who likes pancakes? I do. I really really like pancakes!"                                  
##  [7] " <strong>Bolded text is bold and daring</strong>"                                          
##  [8] "figure1.jpg"                                                                               
##  [9] "cover.jpg"                                                                                 
## [10] "report.pdf"                                                                                
## [11] "log.txt"                                                                                   
## [12] "I'm a one man wolfpack and I weigh 222"                                                    
## [13] "OMG, a 3-eyed cyclops!!!"                                                                  
## [14] "2112 is an awesome album."                                                                 
## [15] "2222 is my PIN"

# Trim Leading and Trailing Whitespace: 
gsub("^ +| +$","", someText)

##  [1] "here's a sentence"                                                                                                  
##  [2] "This is me typing at 2:02 in the morning"                                                                           
##  [3] "Back in 1995 I was only 22."                                                                                        
##  [4] "You saw 4 monkeys?"                                                                                                 
##  [5] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"
##  [6] "Who likes pancakes? I do. I really really like pancakes!"                                                           
##  [7] "<strong>Bolded text is bold and daring</strong>"                                                                    
##  [8] "figure1.jpg"                                                                                                        
##  [9] "cover.jpg"                                                                                                          
## [10] "report.pdf"                                                                                                         
## [11] "log.txt"                                                                                                            
## [12] "I'm a one man wolfpack and I weigh 222"                                                                             
## [13] "OMG, a 3-eyed cyclops!!!"                                                                                           
## [14] "2112 is an awesome album."                                                                                          
## [15] "2222 is my PIN"

# Or better yet, just use the built-in function
trimws(someText)

##  [1] "here's a sentence"                                                                                                  
##  [2] "This is me typing at 2:02 in the morning"                                                                           
##  [3] "Back in 1995 I was only 22."                                                                                        
##  [4] "You saw 4 monkeys?"                                                                                                 
##  [5] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"
##  [6] "Who likes pancakes? I do. I really really like pancakes!"                                                           
##  [7] "<strong>Bolded text is bold and daring</strong>"                                                                    
##  [8] "figure1.jpg"                                                                                                        
##  [9] "cover.jpg"                                                                                                          
## [10] "report.pdf"                                                                                                         
## [11] "log.txt"                                                                                                            
## [12] "I'm a one man wolfpack and I weigh 222"                                                                             
## [13] "OMG, a 3-eyed cyclops!!!"                                                                                           
## [14] "2112 is an awesome album."                                                                                          
## [15] "2222 is my PIN"

# Replace a new line with a space
gsub("\\n"," ",someText)

##  [1] "  here's a sentence"                                                                                               
##  [2] "This is me typing at 2:02 in the morning"                                                                          
##  [3] "Back in 1995 I was only 22."                                                                                       
##  [4] "You saw 4 monkeys?"                                                                                                
##  [5] "There are 10 kinds of people,    those that understand binary                      and the other 9 that don't care"
##  [6] "Who likes pancakes? I do. I really really like pancakes!"                                                          
##  [7] "     <strong>Bolded text is bold and daring</strong>"                                                              
##  [8] "figure1.jpg"                                                                                                       
##  [9] "cover.jpg"                                                                                                         
## [10] "report.pdf"                                                                                                        
## [11] "log.txt"                                                                                                           
## [12] "I'm a one man wolfpack and I weigh 222"                                                                            
## [13] "OMG, a 3-eyed cyclops!!!"                                                                                          
## [14] "2112 is an awesome album."                                                                                         
## [15] "2222 is my PIN"

# Remove HTML/XML tags (basic)
# "<" followed by anything but ">" and ending with ">" 
gsub("<[^>]*>","",someText)

##  [1] "  here's a sentence"                                                                                                
##  [2] "This is me typing at 2:02 in the morning"                                                                           
##  [3] "Back in 1995 I was only 22."                                                                                        
##  [4] "You saw 4 monkeys?"                                                                                                 
##  [5] "There are 10 kinds of people,    those that understand binary       \n              and the other 9 that don't care"
##  [6] "Who likes pancakes? I do. I really really like pancakes!"                                                           
##  [7] "     Bolded text is bold and daring"                                                                                
##  [8] "figure1.jpg"                                                                                                        
##  [9] "cover.jpg"                                                                                                          
## [10] "report.pdf"                                                                                                         
## [11] "log.txt"                                                                                                            
## [12] "I'm a one man wolfpack and I weigh 222"                                                                             
## [13] "OMG, a 3-eyed cyclops!!!"                                                                                           
## [14] "2112 is an awesome album."                                                                                          
## [15] "2222 is my PIN"

# Or better yet, just use the qdapRegex function rm_angle()
rm_angle(someText)

##  [1] "here's a sentence"                                                                         
##  [2] "This is me typing at 2:02 in the morning"                                                  
##  [3] "Back in 1995 I was only 22."                                                               
##  [4] "You saw 4 monkeys?"                                                                        
##  [5] "There are 10 kinds of people, those that understand binary and the other 9 that don't care"
##  [6] "Who likes pancakes? I do. I really really like pancakes!"                                  
##  [7] "Bolded text is bold and daring"                                                            
##  [8] "figure1.jpg"                                                                               
##  [9] "cover.jpg"                                                                                 
## [10] "report.pdf"                                                                                
## [11] "log.txt"                                                                                   
## [12] "I'm a one man wolfpack and I weigh 222"                                                    
## [13] "OMG, a 3-eyed cyclops!!!"                                                                  
## [14] "2112 is an awesome album."                                                                 
## [15] "2222 is my PIN"

# Extract with RegEx ------------------------------------------------------

# The base R functions regexpr() and gregexpr() along with regmatches() can be 
# used to extract character matches, but I find the str_extract() and 
# str_extract_all() in the stringr package to be easier and faster to use. 
# str_extract() extracts first piece of a string that matches a pattern while
# str_extract_all() extracts all matches. A few examples:

# Extract one- or two-digit numbers:
# first match
str_extract(someText, "[0-9]{1,2}")

##  [1] NA   "2"  "19" "4"  "10" NA   NA   "1"  NA   NA   NA   "22" "3"  "21"
## [15] "22"

# all matches; returns a list
str_extract_all(someText, "[0-9]{1,2}")

## [[1]]
## character(0)
## 
## [[2]]
## [1] "2"  "02"
## 
## [[3]]
## [1] "19" "95" "22"
## 
## [[4]]
## [1] "4"
## 
## [[5]]
## [1] "10" "9" 
## 
## [[6]]
## character(0)
## 
## [[7]]
## character(0)
## 
## [[8]]
## [1] "1"
## 
## [[9]]
## character(0)
## 
## [[10]]
## character(0)
## 
## [[11]]
## character(0)
## 
## [[12]]
## [1] "22" "2" 
## 
## [[13]]
## [1] "3"
## 
## [[14]]
## [1] "21" "12"
## 
## [[15]]
## [1] "22" "22"

# can use the base R function unlist() function to get just the numbers in a
# vector:
unlist(str_extract_all(someText, "[0-9]{1,2}"))

##  [1] "2"  "02" "19" "95" "22" "4"  "10" "9"  "1"  "22" "2"  "3"  "21" "12"
## [15] "22" "22"

# Extract a string that contains a . followed by 3 lower-case letters (file
# extensions)
str_extract(someText,"\\.[a-z]{3}")

##  [1] NA     NA     NA     NA     NA     NA     NA     ".jpg" ".jpg" ".pdf"
## [11] ".txt" NA     NA     NA     NA

# just the file extenions without a period (not very elegant but works)
str_extract(someText,"(jpg|tif|pdf|txt)$")

##  [1] NA    NA    NA    NA    NA    NA    NA    "jpg" "jpg" "pdf" "txt"
## [12] NA    NA    NA    NA

# Extract text beginning with only letters, and containing only letters,
# ending in .jpg
str_extract(someText, "^[a-z]+\\.jpg")

##  [1] NA          NA          NA          NA          NA         
##  [6] NA          NA          NA          "cover.jpg" NA         
## [11] NA          NA          NA          NA          NA

# to get just the text
tmp <- str_extract(someText, "^[a-z]+\\.jpg")
tmp[!is.na(tmp)]

## [1] "cover.jpg"

# Web scraping ------------------------------------------------------------

# Regular Expressions can be very helpful when doing web scraping. Let's scrape 
# some data and demonstrate. Simply give a URL as an argument to the readLines()
# function. The readLines() function reads in text lines. The following reads in
# the HTML code of a web page into a single vector.

# 113th Congress Senate Bills: first 100 results.
senate_bills <- readLines("http://thomas.loc.gov/cgi-bin/bdquery/d?d113:0:./list/bss/d113SN.lst:")

# Notice senate_bills is a vector, not a data frame. Each element of text
# corresponds to one line of HTML code:
senate_bills[1:10]

##  [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"
##  [2] "<html xmlns=\"http://www.w3.org/1999/xhtml\">"                                                                                
##  [3] "<head profile=\"http://dublincore.org/documents/2008/08/04/dc-html/\">"                                                       
##  [4] "<title>Bill Summary & Status Search Results - THOMAS (Library of Congress)</title>"                                           
##  [5] "  <meta http-equiv=\"Content-Language\" content=\"en-us\" />"                                                                 
##  [6] "  <meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\" />"                                             
##  [7] "  <meta http-equiv=\"X-UA-Compatible\" content=\"IE=EmulateIE7\"/>"                                                           
##  [8] "  <link rel=\"stylesheet\" media=\"print\" type=\"text/css\" href=\"/css/loc_print_ss.css\" />"                               
##  [9] "  <style type=\"text/css\">"                                                                                                  
## [10] "    @import url(/css/loc_thomas_results100.css);"

# We'd like to create a data frame that includes bill number, bill title,
# sponsor, and number of co-sponsors.

# In the HTML we see that bill number, title, and sponsor are in lines that
# begin like this: "<p><b>15.</b>". We can use regular expressions to find all
# 1-3 digit numbers followed by a period and </b>.


# grep() can find the indices of such patterns,
k <- grep("[0-9]{1,3}\\.</b>", senate_bills)
k[1:4]

## [1] 30 35 40 45

# Use k to subset the data
temp <- senate_bills[k]
head(temp)

## [1] "<hr/>Items <b>1</b> through <b>100</b> of <b>3020</b><p><b> 1.</b> <a href=\"/cgi-bin/bdquery/D?d113:1:./list/bss/d113SN.lst::\">S.1 </a>:  Immigration Reform that Works for America's Future Act<br /><b>Sponsor:</b> <a href=\"/cgi-bin/bdquery/?&amp;Db=d113&amp;querybd=@FIELD(FLD003+@4((@1(Sen+Reid++Harry))+00952))\">Sen Reid, Harry</a> [NV]"
## [2] "<p><b> 2.</b> <a href=\"/cgi-bin/bdquery/D?d113:2:./list/bss/d113SN.lst::\">S.2 </a>:  Sandy Hook Elementary School Violence Reduction Act<br /><b>Sponsor:</b> <a href=\"/cgi-bin/bdquery/?&amp;Db=d113&amp;querybd=@FIELD(FLD003+@4((@1(Sen+Reid++Harry))+00952))\">Sen Reid, Harry</a> [NV]"                                                        
## [3] "<p><b> 3.</b> <a href=\"/cgi-bin/bdquery/D?d113:3:./list/bss/d113SN.lst::\">S.3 </a>:  Strengthen our Schools and Students Act<br /><b>Sponsor:</b> <a href=\"/cgi-bin/bdquery/?&amp;Db=d113&amp;querybd=@FIELD(FLD003+@4((@1(Sen+Reid++Harry))+00952))\">Sen Reid, Harry</a> [NV]"                                                                    
## [4] "<p><b> 4.</b> <a href=\"/cgi-bin/bdquery/D?d113:4:./list/bss/d113SN.lst::\">S.4 </a>:  Rebuild America Act<br /><b>Sponsor:</b> <a href=\"/cgi-bin/bdquery/?&amp;Db=d113&amp;querybd=@FIELD(FLD003+@4((@1(Sen+Reid++Harry))+00952))\">Sen Reid, Harry</a> [NV]"                                                                                        
## [5] "<p><b> 5.</b> <a href=\"/cgi-bin/bdquery/D?d113:5:./list/bss/d113SN.lst::\">S.5 </a>:  A bill to reauthorize the Violence Against Women Act of 1994.<br /><b>Sponsor:</b> <a href=\"/cgi-bin/bdquery/?&amp;Db=d113&amp;querybd=@FIELD(FLD003+@4((@1(Sen+Reid++Harry))+00952))\">Sen Reid, Harry</a> [NV]"                                              
## [6] "<p><b> 6.</b> <a href=\"/cgi-bin/bdquery/D?d113:6:./list/bss/d113SN.lst::\">S.6 </a>:  Putting Our Veterans Back to Work Act of 2013<br /><b>Sponsor:</b> <a href=\"/cgi-bin/bdquery/?&amp;Db=d113&amp;querybd=@FIELD(FLD003+@4((@1(Sen+Reid++Harry))+00952))\">Sen Reid, Harry</a> [NV]"

tail(temp)

## [1] "<p><b>95.</b> <a href=\"/cgi-bin/bdquery/D?d113:95:./list/bss/d113SN.lst::\">S.95 </a>:  A bill to withhold United States contributions to the United Nations until the United Nations formally retracts the final report of the \"United Nations Fact Finding Mission on the Gaza Conflict\".<br /><b>Sponsor:</b> <a href=\"/cgi-bin/bdquery/?&amp;Db=d113&amp;querybd=@FIELD(FLD003+@4((@1(Sen+Vitter++David))+01609))\">Sen Vitter, David</a> [LA]"
## [2] "<p><b>96.</b> <a href=\"/cgi-bin/bdquery/D?d113:96:./list/bss/d113SN.lst::\">S.96 </a>:  Rigs to Reefs Habitat Protection Act<br /><b>Sponsor:</b> <a href=\"/cgi-bin/bdquery/?&amp;Db=d113&amp;querybd=@FIELD(FLD003+@4((@1(Sen+Vitter++David))+01609))\">Sen Vitter, David</a> [LA]"                                                                                                                                                                 
## [3] "<p><b>97.</b> <a href=\"/cgi-bin/bdquery/D?d113:97:./list/bss/d113SN.lst::\">S.97 </a>:  Small Business Paperwork Relief Act of 2013<br /><b>Sponsor:</b> <a href=\"/cgi-bin/bdquery/?&amp;Db=d113&amp;querybd=@FIELD(FLD003+@4((@1(Sen+Vitter++David))+01609))\">Sen Vitter, David</a> [LA]"                                                                                                                                                          
## [4] "<p><b>98.</b> <a href=\"/cgi-bin/bdquery/D?d113:98:./list/bss/d113SN.lst::\">S.98 </a>:  Local Disaster Contracting Fairness Act of 2013<br /><b>Sponsor:</b> <a href=\"/cgi-bin/bdquery/?&amp;Db=d113&amp;querybd=@FIELD(FLD003+@4((@1(Sen+Vitter++David))+01609))\">Sen Vitter, David</a> [LA]"                                                                                                                                                      
## [5] "<p><b>99.</b> <a href=\"/cgi-bin/bdquery/D?d113:99:./list/bss/d113SN.lst::\">S.99 </a>:  Natural Disaster Fairness in Contracting Act of 2013<br /><b>Sponsor:</b> <a href=\"/cgi-bin/bdquery/?&amp;Db=d113&amp;querybd=@FIELD(FLD003+@4((@1(Sen+Vitter++David))+01609))\">Sen Vitter, David</a> [LA]"                                                                                                                                                 
## [6] "<p><b>100.</b> <a href=\"/cgi-bin/bdquery/D?d113:100:./list/bss/d113SN.lst::\">S.100 </a>:  Terminating the Expansion of Too-Big-To-Fail Act of 2013<br /><b>Sponsor:</b> <a href=\"/cgi-bin/bdquery/?&amp;Db=d113&amp;querybd=@FIELD(FLD003+@4((@1(Sen+Vitter++David))+01609))\">Sen Vitter, David</a> [LA]"

# Now replace the HTML tags with space
temp <- gsub("<[^>]*>", " ",temp)
head(temp)

## [1] " Items  1  through  100  of  3020    1.   S.1  :  Immigration Reform that Works for America's Future Act  Sponsor:   Sen Reid, Harry  [NV]"
## [2] "   2.   S.2  :  Sandy Hook Elementary School Violence Reduction Act  Sponsor:   Sen Reid, Harry  [NV]"                                     
## [3] "   3.   S.3  :  Strengthen our Schools and Students Act  Sponsor:   Sen Reid, Harry  [NV]"                                                 
## [4] "   4.   S.4  :  Rebuild America Act  Sponsor:   Sen Reid, Harry  [NV]"                                                                     
## [5] "   5.   S.5  :  A bill to reauthorize the Violence Against Women Act of 1994.  Sponsor:   Sen Reid, Harry  [NV]"                           
## [6] "   6.   S.6  :  Putting Our Veterans Back to Work Act of 2013  Sponsor:   Sen Reid, Harry  [NV]"

tail(temp)

## [1] "  95.   S.95  :  A bill to withhold United States contributions to the United Nations until the United Nations formally retracts the final report of the \"United Nations Fact Finding Mission on the Gaza Conflict\".  Sponsor:   Sen Vitter, David  [LA]"
## [2] "  96.   S.96  :  Rigs to Reefs Habitat Protection Act  Sponsor:   Sen Vitter, David  [LA]"                                                                                                                                                                 
## [3] "  97.   S.97  :  Small Business Paperwork Relief Act of 2013  Sponsor:   Sen Vitter, David  [LA]"                                                                                                                                                          
## [4] "  98.   S.98  :  Local Disaster Contracting Fairness Act of 2013  Sponsor:   Sen Vitter, David  [LA]"                                                                                                                                                      
## [5] "  99.   S.99  :  Natural Disaster Fairness in Contracting Act of 2013  Sponsor:   Sen Vitter, David  [LA]"                                                                                                                                                 
## [6] "  100.   S.100  :  Terminating the Expansion of Too-Big-To-Fail Act of 2013  Sponsor:   Sen Vitter, David  [LA]"

# break vector elements by ":"
temp <- strsplit(temp,":")

# Let's see what we have so far:
head(temp)

## [[1]]
## [1] " Items  1  through  100  of  3020    1.   S.1  "                  
## [2] "  Immigration Reform that Works for America's Future Act  Sponsor"
## [3] "   Sen Reid, Harry  [NV]"                                         
## 
## [[2]]
## [1] "   2.   S.2  "                                                 
## [2] "  Sandy Hook Elementary School Violence Reduction Act  Sponsor"
## [3] "   Sen Reid, Harry  [NV]"                                      
## 
## [[3]]
## [1] "   3.   S.3  "                                     
## [2] "  Strengthen our Schools and Students Act  Sponsor"
## [3] "   Sen Reid, Harry  [NV]"                          
## 
## [[4]]
## [1] "   4.   S.4  "                  "  Rebuild America Act  Sponsor"
## [3] "   Sen Reid, Harry  [NV]"      
## 
## [[5]]
## [1] "   5.   S.5  "                                                           
## [2] "  A bill to reauthorize the Violence Against Women Act of 1994.  Sponsor"
## [3] "   Sen Reid, Harry  [NV]"                                                
## 
## [[6]]
## [1] "   6.   S.6  "                                           
## [2] "  Putting Our Veterans Back to Work Act of 2013  Sponsor"
## [3] "   Sen Reid, Harry  [NV]"

# To get the bill numbers we can pull out the first element of each list
# component as follows:
bill <- sapply(temp,function(x)x[1])
head(bill)

## [1] " Items  1  through  100  of  3020    1.   S.1  "
## [2] "   2.   S.2  "                                  
## [3] "   3.   S.3  "                                  
## [4] "   4.   S.4  "                                  
## [5] "   5.   S.5  "                                  
## [6] "   6.   S.6  "

# Now we can use str_extract() to pull out the bill numbers. I've decided to
# keep the "S":
bill <-str_extract(bill, "S\\.[0-9]{1,3}")
head(bill)

## [1] "S.1" "S.2" "S.3" "S.4" "S.5" "S.6"

# Now let's get the bill title. It's in the second element.
temp[[1]]

## [1] " Items  1  through  100  of  3020    1.   S.1  "                  
## [2] "  Immigration Reform that Works for America's Future Act  Sponsor"
## [3] "   Sen Reid, Harry  [NV]"

# pull out second element of each list component
title <- sapply(temp,function(x)x[2])
title[1:4]

## [1] "  Immigration Reform that Works for America's Future Act  Sponsor"
## [2] "  Sandy Hook Elementary School Violence Reduction Act  Sponsor"   
## [3] "  Strengthen our Schools and Students Act  Sponsor"               
## [4] "  Rebuild America Act  Sponsor"

# get rid of " Sponsor" at end
title <- gsub(" Sponsor$","",title)
# get rid of leading and trailing spaces
title <- trimws(title) 
head(title)

## [1] "Immigration Reform that Works for America's Future Act"       
## [2] "Sandy Hook Elementary School Violence Reduction Act"          
## [3] "Strengthen our Schools and Students Act"                      
## [4] "Rebuild America Act"                                          
## [5] "A bill to reauthorize the Violence Against Women Act of 1994."
## [6] "Putting Our Veterans Back to Work Act of 2013"

# Now get the bill sponsor. It's in the third element.
temp[[1]]

## [1] " Items  1  through  100  of  3020    1.   S.1  "                  
## [2] "  Immigration Reform that Works for America's Future Act  Sponsor"
## [3] "   Sen Reid, Harry  [NV]"

sponsor <- sapply(temp,function(x)x[3])
sponsor <- trimws(sponsor) # get rid of leading spaces
head(sponsor)

## [1] "Sen Reid, Harry  [NV]" "Sen Reid, Harry  [NV]" "Sen Reid, Harry  [NV]"
## [4] "Sen Reid, Harry  [NV]" "Sen Reid, Harry  [NV]" "Sen Reid, Harry  [NV]"

# Get number of cosponsors by first finding those vector elements that contain 
# the string "Cosponsors". Have to be careful; not all bills have Cosponsors
# (ie, Cosponsors (None) ) but all have the word "Cosponsors".

k <- grep("Cosponsors",senate_bills)
# subset vector to contain only those matching elements
temp <- senate_bills[k]
head(temp)

## [1] "(introduced 1/22/2013) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=\"/cgi-bin/bdquery/D?d113:1:./list/bss/d113SN.lst:@@@P\">Cosponsors</a> (15)"
## [2] "(introduced 1/22/2013) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=\"/cgi-bin/bdquery/D?d113:2:./list/bss/d113SN.lst:@@@P\">Cosponsors</a> (16)"
## [3] "(introduced 1/22/2013) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=\"/cgi-bin/bdquery/D?d113:3:./list/bss/d113SN.lst:@@@P\">Cosponsors</a> (16)"
## [4] "(introduced 1/22/2013) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=\"/cgi-bin/bdquery/D?d113:4:./list/bss/d113SN.lst:@@@P\">Cosponsors</a> (14)"
## [5] "(introduced 1/22/2013) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=\"/cgi-bin/bdquery/D?d113:5:./list/bss/d113SN.lst:@@@P\">Cosponsors</a> (31)"
## [6] "(introduced 1/22/2013) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=\"/cgi-bin/bdquery/D?d113:6:./list/bss/d113SN.lst:@@@P\">Cosponsors</a> (25)"

# Now extract number of cosponsors; either None or a 1-2 digit number.
cosponsors <- str_extract(temp, pattern = "\\([[:alnum:]]{1,4}\\)")
# Get rid of parentheses
cosponsors <- gsub(pattern = "[\\(|\\)]", replacement = "", cosponsors)
# Replace "None" with 0 and convert to numeric
cosponsors <- as.numeric(gsub("None",0,cosponsors))
summary(cosponsors)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00    6.91    8.50   61.00

# And finally create data frame
senate_bills <- data.frame(bill, title, sponsor, cosponsors, 
                           stringsAsFactors = FALSE)
head(senate_bills)

##   bill                                                         title
## 1  S.1        Immigration Reform that Works for America's Future Act
## 2  S.2           Sandy Hook Elementary School Violence Reduction Act
## 3  S.3                       Strengthen our Schools and Students Act
## 4  S.4                                           Rebuild America Act
## 5  S.5 A bill to reauthorize the Violence Against Women Act of 1994.
## 6  S.6                 Putting Our Veterans Back to Work Act of 2013
##                 sponsor cosponsors
## 1 Sen Reid, Harry  [NV]         15
## 2 Sen Reid, Harry  [NV]         16
## 3 Sen Reid, Harry  [NV]         16
## 4 Sen Reid, Harry  [NV]         14
## 5 Sen Reid, Harry  [NV]         31
## 6 Sen Reid, Harry  [NV]         25

# What if we wanted to do this for all results? We have to iterate through the URLs.

# http://thomas.loc.gov/cgi-bin/bdquery/d?d113:0:./list/bss/d113SN.lst:[[o]]&items=100&
# http://thomas.loc.gov/cgi-bin/bdquery/d?d113:100:./list/bss/d113SN.lst:[[o]]&items=100&
# http://thomas.loc.gov/cgi-bin/bdquery/d?d113:200:./list/bss/d113SN.lst:[[o]]&items=100&
# ...  
# http://thomas.loc.gov/cgi-bin/bdquery/d?d113:3000:./list/bss/d113SN.lst:[[o]]&items=100&

# We also may want to create a data frame in advance to store the data
  
SenateBills <- data.frame(bill=character(3020), title=character(3020), 
                          sponsor=character(3020), 
                          cosponsors=numeric(3020), 
                          stringsAsFactors = FALSE)

# Now cycle through the URLS using the code from above. I suppose ideally I
# would determine the upper bound of my sequence (3000) programmatically, but
# this is a one-off for the 113th congress so I'm cutting myself some slack.

for(i in seq(0,3000,100)){
  senate_bills <- readLines(paste0("http://thomas.loc.gov/cgi-bin/bdquery/d?d113:",i,":./list/bss/d113SN.lst:"))
  # bill number
  k <- grep("[0-9]{1,3}\\.</b>", senate_bills)
  temp <- senate_bills[k]
  temp <- gsub("<[^>]*>", " ",temp)
  temp <- strsplit(temp,":")
  bill <- sapply(temp,function(x)x[1])
  bill <- str_extract(bill, "S\\.[0-9]{1,4}") # need to increase to 4 digits
  # title
  title <- sapply(temp,function(x)x[2])
  title <- gsub(" Sponsor$","",title)
  title <- trimws(title) 
  # sponsor
  sponsor <- sapply(temp,function(x)x[3])
  sponsor <- trimws(sponsor) 
  # coponsors
  k <- grep("Cosponsors",senate_bills)
  temp <- senate_bills[k]
  cosponsors <- str_extract(temp, pattern = "\\([[:alnum:]]{1,4}\\)")
  cosponsors <- gsub(pattern = "[\\(|\\)]", replacement = "", cosponsors)
  cosponsors <- as.numeric(gsub("None",0,cosponsors))
  # add to data frame
  rows <- (i+1):(i+length(k))
  SenateBills[rows,] <- data.frame(bill, title, sponsor, cosponsors, stringsAsFactors = FALSE)
}


# For another web scraping tutorial of mine, see:
# https://github.com/UVa-R-Users-Group/meetup/tree/master/2014-10-07-web-scraping

# The rvest package by Hadley Wickham allows you to "Easily Harvest (Scrape) Web
# Pages":
# http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/

# The XML package also has some functions for converting HTML tables to data
# frames.


# RegEx within data frames ------------------------------------------------

# Recall our allStocks data. We wanted to add a column indicating which stock 
# each row belongs to. We can use gsub() and regular expressions to easily
# do this.
head(allStocks)

##                 Date  Open  High   Low Close  Volume
## bbby.csv.1 26-Mar-14 67.76 68.05 67.18 67.25 1785164
## bbby.csv.2 25-Mar-14 67.61 67.93 67.34 67.73 1571625
## bbby.csv.3 24-Mar-14 67.73 68.00 66.99 67.26 1742341
## bbby.csv.4 21-Mar-14 68.41 68.41 67.29 67.55 3639114
## bbby.csv.5 20-Mar-14 67.58 68.12 67.52 67.82 1328860
## bbby.csv.6 19-Mar-14 68.40 68.61 67.43 67.89 2116779

# Notice the row name contains the name of the stock. We can extract the row
# names and formally add them to the data frame using the rownames() function.

# first three row names:
rownames(allStocks)[1:3]

## [1] "bbby.csv.1" "bbby.csv.2" "bbby.csv.3"

# extract all row names and add to data frame:
allStocks$Stock <- rownames(allStocks)
head(allStocks)

##                 Date  Open  High   Low Close  Volume      Stock
## bbby.csv.1 26-Mar-14 67.76 68.05 67.18 67.25 1785164 bbby.csv.1
## bbby.csv.2 25-Mar-14 67.61 67.93 67.34 67.73 1571625 bbby.csv.2
## bbby.csv.3 24-Mar-14 67.73 68.00 66.99 67.26 1742341 bbby.csv.3
## bbby.csv.4 21-Mar-14 68.41 68.41 67.29 67.55 3639114 bbby.csv.4
## bbby.csv.5 20-Mar-14 67.58 68.12 67.52 67.82 1328860 bbby.csv.5
## bbby.csv.6 19-Mar-14 68.40 68.61 67.43 67.89 2116779 bbby.csv.6

# Let's reset the row names:
rownames(allStocks) <- NULL
head(allStocks)

##        Date  Open  High   Low Close  Volume      Stock
## 1 26-Mar-14 67.76 68.05 67.18 67.25 1785164 bbby.csv.1
## 2 25-Mar-14 67.61 67.93 67.34 67.73 1571625 bbby.csv.2
## 3 24-Mar-14 67.73 68.00 66.99 67.26 1742341 bbby.csv.3
## 4 21-Mar-14 68.41 68.41 67.29 67.55 3639114 bbby.csv.4
## 5 20-Mar-14 67.58 68.12 67.52 67.82 1328860 bbby.csv.5
## 6 19-Mar-14 68.40 68.61 67.43 67.89 2116779 bbby.csv.6

# Now we find the pattern "\\.csv\\.[0-9]{1,3}" and replace with nothing. Recall
# that "." is metacharacter that has to be escaped. [0-9]{1,3} translates to all
# numbers ranging from 0 - 999.
allStocks$Stock <- gsub(pattern = "\\.csv\\.[0-9]{1,3}", 
                        replacement = "", 
                        allStocks$Stock)

# and let's make our new variable a factor:
allStocks$Stock <- factor(allStocks$Stock)
head(allStocks)

##        Date  Open  High   Low Close  Volume Stock
## 1 26-Mar-14 67.76 68.05 67.18 67.25 1785164  bbby
## 2 25-Mar-14 67.61 67.93 67.34 67.73 1571625  bbby
## 3 24-Mar-14 67.73 68.00 66.99 67.26 1742341  bbby
## 4 21-Mar-14 68.41 68.41 67.29 67.55 3639114  bbby
## 5 20-Mar-14 67.58 68.12 67.52 67.82 1328860  bbby
## 6 19-Mar-14 68.40 68.61 67.43 67.89 2116779  bbby

tail(allStocks)

##           Date  Open  High   Low Close  Volume Stock
## 1616 16-Apr-13 64.42 66.23 64.36 66.19 4038612  viab
## 1617 15-Apr-13 66.04 66.23 63.99 64.02 3078540  viab
## 1618 12-Apr-13 66.30 66.63 65.62 66.50 2029401  viab
## 1619 11-Apr-13 65.69 66.76 65.69 66.15 2784995  viab
## 1620 10-Apr-13 63.76 65.77 63.76 65.70 2259979  viab
## 1621  9-Apr-13 65.78 66.09 64.57 64.60 4605824  viab

summary(allStocks$Stock)

## bbby flws foxa  ftd  tfm  twx viab 
##  251  251  251  115  251  251  251

# While we're at it, let's fix the Date. (Currently a factor.)
allStocks$Date <- as.Date(allStocks$Date, format="%d-%b-%y")

# And just for fun, graph closing price over time for all stocks one on graph:
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:qdapRegex':
## 
##     %+%

ggplot(allStocks, aes(x=Date, y = Close, color=Stock)) + geom_line()

# Now let's finish cleaning up the 2012 election data!
names(electionData)

##  [1] "State"                            "Total Elec Vote"                 
##  [3] "Total.1 Popular Vote"             "Elec Vote D"                     
##  [5] "NA R"                             "NA.1 O"                          
##  [7] "Pop Vote D"                       "NA.2 R"                          
##  [9] "NA.3 I"                           "Margin of Victory Votes"         
## [11] "NA.4 % Total Vote"                "Obama Democratic"                
## [13] "NA.5 NA"                          "Romney Republican"               
## [15] "NA.6 NA"                          "0 Independent"                   
## [17] "NA.7 NA"                          "Johnson Libertarian"             
## [19] "NA.8 NA"                          "Stein Green"                     
## [21] "NA.9 NA"                          "Goode Constitution"              
## [23] "NA.10 NA"                         "Harris Socialist Workers"        
## [25] "NA.11 NA"                         "Alexander Socialist"             
## [27] "NA.12 NA"                         "Lindsay Socialism and Liberation"
## [29] "NA.13 NA"                         "Write-ins -"                     
## [31] "NA.14 NA"                         "Anderson Justice"                
## [33] "NA.15 NA"                         "Hoefling American Ind."          
## [35] "NA.16 NA"                         "Barr Peace & Freedom"            
## [37] "NA.17 NA"                         "None -"                          
## [39] "NA.18 NA"                         "Carlson Grassroots"              
## [41] "NA.19 NA"                         "Morstad Const. Government"       
## [43] "NA.20 NA"                         "Miller American Third Position"  
## [45] "NA.21 NA"                         "Fellure Prohibition"             
## [47] "NA.22 NA"                         "Stevens Objectivist"             
## [49] "NA.23 NA"                         "White Socialist Equality"        
## [51] "NA.24 NA"                         "Barnett Reform"                  
## [53] "NA.25 NA"                         "Terry Independent"               
## [55] "NA.26 NA"                         "Reed Independent"                
## [57] "NA.27 NA"                         "Litzel Independent"              
## [59] "NA.28 NA"                         "Tittle We the People"            
## [61] "NA.29 NA"                         "Duncan Independent"              
## [63] "NA.30 NA"                         "Boss NSA Did 911"                
## [65] "NA.31 NA"                         "Washer Reform"                   
## [67] "NA.32 NA"                         "Baldwin Reform"                  
## [69] "NA.33 NA"                         "Christensen Constitution"        
## [71] "NA.34 NA"                         "NA.35 State"                     
## [73] " NA"                              ".1 EV"                           
## [75] "J NA"                             "S NA"                            
## [77] "H NA"                             "G NA"                            
## [79] ".2 State Code"                    ".3 Blanks"                       
## [81] ".4 EV"                            ".5 Meth"

# I want to drop everything to the right of the "NA.34 NA" column. Frankly I'm
# not sure what those columns contain.

# Get the column number of the column with header "NA.34 NA"
fir <- grep("NA.34 NA", names(electionData))
fir

## [1] 71

# get the column number of the last column in the data frame
las <- ncol(electionData)
las

## [1] 82

# Now subset the data frame; keep all columns except 72-82
electionData <- electionData[,-c(fir:las)]

# drop columns with names of "NA.1, NA.2, etc"; these are proportions. I can
# always derive them later if I want them.
ind <- grep("NA\\.[0-9]{1,2}", names(electionData))
ind

##  [1]  6  8  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
## [24] 51 53 55 57 59 61 63 65 67 69

electionData <- electionData[,-ind]

# and some final clean up
names(electionData)[3] <- "Total.Popular.Vote" 
names(electionData)[5] <- "Elec Vote R" 
electionData$"Pop Vote D" <- NULL
rownames(electionData) <- NULL

# still some lingering character columns
which(sapply(electionData, is.character))

##                   State         Total Elec Vote      Total.Popular.Vote 
##                       1                       2                       3 
##             Elec Vote D             Elec Vote R Margin of Victory Votes 
##                       4                       5                       6

# convert to numeric
electionData[,2:6] <- sapply(electionData[,2:6], as.numeric)

# Now our election data contains only number of votes. 


# Another extended example ------------------------------------------------

# Let's add Occupation names to the arrests data. Recall that the Occup columns 
# contains numeric codes for Occupations. I'd like to make that column a factor
# where the codes are associated with levels that define the code number.
arrests$Occup[1:5]

## [1] 172  92  43  70  24

# First we read in a document that contains Occupation code numbers and the
# occupation name. I created this from the codebook that accompanied this data.
oc <- readLines("../data/00049-Occupation-codes.txt", warn=FALSE)

# trim whitespace
oc <- trimws(oc)

# Notice all code numbers are in the first three positions. Let's use stringr 
# for the str_extract() function. Notice we need to convert to integer to match
# the integer codes in the arrests data frame.
codeNums <- as.integer(str_extract(string = oc, pattern = "^[0-9]{3}"))

# Getting the code names is a little harder. There are probably a dozen 
# different ways to proceed from this point on, but here's how I decided to do
# it. Basically extract everything except numbers.
codeNames <- trimws(str_extract(string = oc, pattern = "[^[:digit:]]+"))
head(codeNames)

## [1] "Alimentation (food) unspecified"     
## [2] "Boucher (Butcher)"                   
## [3] "Boulanger (Baker)"                   
## [4] "Garcons de cafe (Waiters)"           
## [5] "Batiment (Construction), unspecified"
## [6] "Charpentier (Carpenter)"

tail(codeNames)

## [1] "Fumiste (Chimney and stove cleaner)"     
## [2] "Brocanteur (Dealer in second hand goods)"
## [3] "Chiffonier (Rag picker)"                 
## [4] "Cannot read film"                        
## [5] "None listed"                             
## [6] "No information given"

# Now I can make Occup a factor with levels equal to codeNums and labels equal 
# to codeNames. I'm going to make a new column so we can compare to the original
# column.
arrests$Occup2 <- factor(arrests$Occup, levels = codeNums, labels = codeNames)

# some quick counts; they seem to match our source file
head(summary(arrests$Occup2))

##      Alimentation (food) unspecified                    Boucher (Butcher) 
##                                  287                                   40 
##                    Boulanger (Baker)            Garcons de cafe (Waiters) 
##                                  120                                   14 
## Batiment (Construction), unspecified              Charpentier (Carpenter) 
##                                  652                                  144

tail(summary(arrests$Occup2))

## Brocanteur (Dealer in second hand goods) 
##                                       10 
##                  Chiffonier (Rag picker) 
##                                       16 
##                         Cannot read film 
##                                        0 
##                              None listed 
##                                        0 
##                     No information given 
##                                       23 
##                                     NA's 
##                                      403

# Apparently there are no codes in the data for Cannot Read Film (997) or None
# listed (998) despite them being listed in the code book.
nrow(subset(arrests, Occup %in% c(997,998)))

## [1] 0

# Which codes are we using that don't have matches in the data?
setdiff(codeNums,arrests$Occup)

## [1] 174 997 998

# 174 = secret society; codebook reports 0 so that makes sense.

# Which codes are in the data that we don't have matches for in codeNums?
setdiff(arrests$Occup, codeNums)

## [1]   1   2 178

# 1, 2, and 178 are not listed in the codebook!

k <- setdiff(arrests$Occup, codeNums)
head(subset(arrests, Occup %in% k, select = c("Occup","Occup2")))

##     Occup Occup2
## 36      1   <NA>
## 116     2   <NA>
## 117     2   <NA>
## 225     2   <NA>
## 346     2   <NA>
## 419     2   <NA>

nrow(subset(arrests, Occup %in% k))

## [1] 403

# 403 records with Occup code that doesn't match codebook

# Bottom line: this data, as provided by ICPSR, is a bit dirty.

# qdapRegex package -------------------------------------------------------

# The qdapRegex package has some pre-defined functions for Regular Expression 
# Removal, Extraction, and Replacement. Let's explore some of them by way of an
# example.

# I have a text file of studio albums by the Canadian rock band, Rush. Read it
# in:
rushSA <- readLines("rush_albums.txt")
rushSA

##  [1] "    Rush (1974)"                  "    Fly by Night (1975)"         
##  [3] "    Caress of Steel (1975)"       "    2112 (1976)"                 
##  [5] "    A Farewell to Kings (1977)"   "    Hemispheres (1978)"          
##  [7] "    Permanent Waves (1980)"       "    Moving Pictures (1981)"      
##  [9] "    Signals (1982)"               "    Grace Under Pressure (1984)" 
## [11] "    Power Windows (1985)"         "    Hold Your Fire (1987)"       
## [13] "    Presto (1989)"                "    Roll the Bones (1991)"       
## [15] "    Counterparts (1993)"          "    Test for Echo (1996)"        
## [17] "    Vapor Trails (2002)"          "    Feedback (2004, cover album)"
## [19] "    Snakes & Arrows (2007)"       "    Clockwork Angels (2012)"

# I'd like to make a data frame with two columns: album title and year released.
# We'll use qdapRegex functions to do this.

# First let's trim the white space:
rushSA <- trimws(rushSA)

# The qdapRegex package has a function called rm_between() that will 
# Remove/Replace/Extract Strings Between 2 Markers. I want to use it to extract 
# album release year between parentheses. Note I have to use the extract=TRUE
# argument:
year <- rm_between(rushSA, left = "(", right=")", extract=TRUE)
# That returns a list; I can use unlist() to make it a vector:
year <- unlist(year)
year

##  [1] "1974"              "1975"              "1975"             
##  [4] "1976"              "1977"              "1978"             
##  [7] "1980"              "1981"              "1982"             
## [10] "1984"              "1985"              "1987"             
## [13] "1989"              "1991"              "1993"             
## [16] "1996"              "2002"              "2004, cover album"
## [19] "2007"              "2012"

# I need to remove the string ", cover album". Could use gsub() to find and
# replace with nothing, but a more general approach would be to extract all
# numbers. Remember the tidyr helper function, extract_numeric?

year <- tidyr::extract_numeric(year)
year

##  [1] 1974 1975 1975 1976 1977 1978 1980 1981 1982 1984 1985 1987 1989 1991
## [15] 1993 1996 2002 2004 2007 2012

# Now get the album titles; this time use the rm_between() function without the 
# extract=TRUE argument. This removes everything between the (), including the
# parentheses.
album <- rm_between(rushSA, left = "(", right=")")
album

##  [1] "Rush"                 "Fly by Night"         "Caress of Steel"     
##  [4] "2112"                 "A Farewell to Kings"  "Hemispheres"         
##  [7] "Permanent Waves"      "Moving Pictures"      "Signals"             
## [10] "Grace Under Pressure" "Power Windows"        "Hold Your Fire"      
## [13] "Presto"               "Roll the Bones"       "Counterparts"        
## [16] "Test for Echo"        "Vapor Trails"         "Feedback"            
## [19] "Snakes & Arrows"      "Clockwork Angels"

# And now our data frame: 
rushStudioAlbums <- data.frame(year, album)
head(rushStudioAlbums)

##   year               album
## 1 1974                Rush
## 2 1975        Fly by Night
## 3 1975     Caress of Steel
## 4 1976                2112
## 5 1977 A Farewell to Kings
## 6 1978         Hemispheres

# There is also a package called stringi that bills itself as "THE string 
# processing package for R". As I understand it, stringr is wrapper for stringi.
# Learn more :http://www.rexamine.com/resources/stringi/

# save data for next set of lecture notes
save(list=c("electionData", "weather", "arrests", "allStocks", "popVa","airplane",
            "SenateBills"), file="../data/datasets_L07.Rda")

Data Wrangling in R: Regular Expressions

Clay Ford

Spring 2016