Brief assesment of Canadian HANsards on http://parl.canadiana.ca/

I worked for 1 day 2.5 days on processing the OCRed PDF Canadian Hansards using the software we created for PoliticalMashup. It is not that easy!
Here I report on my findings

Download and preprocess

This is quite easy. It is a bit problematic that the PDFs are very large, and more that they are bound in volumes. But we can extract the text with standard software.

           curl "http://cos.swiss.canadiana.ca/tdr/oop/508/oop.debates_SDC3501_02/data/sip/data/files/debates_SDC3501_02.pdf?sessioncount=17&signature=9c14a3e9c8a4a329af308d2ab501228da707a8ba&file=tdr%2Foop%2F508%2Foop.debates_SDC3501_02%2Fdata%2Fsip%2Fdata%2Ffiles%2Fdebates_SDC3501_02.pdf&sessionid=9fb7e9255fa60b4cfdaa8265ed0207427a0ee031&portalid=parl&key=1&expires=1397001600" > SenateFrench1995.pdf
           pdftohtml -xml -hidden SenateFrench1995.pdf
           # output is not well formed XML, repair as follows
           cat SenateENG27.1.1.xml |sed 's%<[/]*i>%%g' > ef; mv ef SenateENG27.1.1.xml 
           # We have a prprocessor for OCRed PDFs, it's performance is not that good on these files.
           # It gets confused by the centered layout of the topic-headers, it seems. It also does not correctly label headers
           # The result can be downloaded from http://staff.science.uva.nl/~marx/pub/SenateFrench1995.xml.gz
           # Another example: http://staff.science.uva.nl/~marx/pub/SenateENG27.1.1.xml.gz
        

pdftohtml has a weird bug on the file CommonsENG24.4.5.pdf, see bwlow

Backlink to PDF's on web

This is easy, as every page has its own URL: http://parl.canadiana.ca/view/oop.debates_HOC2404_05/7 is pdf-page 7 from file http://parl.canadiana.ca/view/oop.debates_HOC2404_05 (which is house of commons, 24th parliament, 4th session, and volume 5

Structuring the text

Test on English Commons from 1961