Brief assesment of Canadian HANsards on

I worked for 1 day 2.5 days on processing the OCRed PDF Canadian Hansards using the software we created for PoliticalMashup. It is not that easy!
Here I report on my findings

Download and preprocess

This is quite easy. It is a bit problematic that the PDFs are very large, and more that they are bound in volumes. But we can extract the text with standard software.

           curl "" > SenateFrench1995.pdf
           pdftohtml -xml -hidden SenateFrench1995.pdf
           # output is not well formed XML, repair as follows
           cat SenateENG27.1.1.xml |sed 's%<[/]*i>%%g' > ef; mv ef SenateENG27.1.1.xml 
           # We have a prprocessor for OCRed PDFs, it's performance is not that good on these files.
           # It gets confused by the centered layout of the topic-headers, it seems. It also does not correctly label headers
           # The result can be downloaded from
           # Another example:

pdftohtml has a weird bug on the file CommonsENG24.4.5.pdf, see bwlow

Backlink to PDF's on web

This is easy, as every page has its own URL: is pdf-page 7 from file (which is house of commons, 24th parliament, 4th session, and volume 5

Structuring the text

Test on English Commons from 1961