I am analyzing a series of pdf magazines (content analysis) , asking various questions, Does story X mention to any of the following specifically in the story? X, Y, Z
I usually convert to txt, but I wonder, is there a better format, that might better keep the structure, metadata ? HTML? xml?