We use here again the function glob() from the module glob. Remember, glob() has the advantage over other methods (e.g. function os.listdir()) that you can filter for files matching a certain pattern. I our case we will attempt to get a list of all documents ending with .pdf.
1.1 First, we need to specify the path where we want to look for pdf files. Next, we need to define the search pattern pattern that matches for all PDF documents. We will use the file ending .pdf.
What is new?
import glob
pdfpath = 'data/pdf-exercise/pdflib'
pattern = '*pdf' ## our search pattern
combined = pdfpath + '/' + pattern ## concatenate the path with the search pattern into one string
print ("I will look for files ending with", pattern, "in the directory", pdfpath, "\n")
1.2 In this step we extract the filenames of all files ending with .pdf in out directory path.
What is new?
dircont = glob.glob(combined)
print(dircont)
1.3 Next, we want to reformat the list dircont such that we only retain the filenames. We will use a split command here, putting the output into a list called path, and a string variable called file. Note the use of the asterisk in front of the variable path!
What is new?
filelist = [] ## initialize an empty list
for file in dircont:
if "/" in file: ## Execute the block only if the filename is preceeded by a path
*dirpath, file = file.split("/")
else:
pass
print (file)
filelist.append(file)
To achieve this we need to first see how to execute an external program from within our Python code. We will use the module subprocess for this purpose.
2.1 Let's first import the module into our script.
import subprocess
2.2 We next specify the program that we want to call from our script. Here we use exemplarily ls to list the contents of a directory. Note, this is another way to access directory contents...
command = 'ls'
print ("The command you want to execute reads: ", command + ' ' + pdfpath, '\n')
2.3 Now that we have specified the program that we want to execute, we can actually do it via making a call to the operation system.
What is new?
# just take the syntax as granted..
pobj = subprocess.Popen([command, pdfpath], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
output, error = pobj.communicate() # the function communicate returns a tuple
# we want to learn something about the data type of the variable output using the function type()
ot = type(output)
print ("The data type of my variable is: ", ot, '\n')
2.4 We have successfully executed the command and captured the output via pobj.communicate in the variable output. This, however, is of type byte. We will now decode it into a variable of type str.
What is new?
## type conversion using the function decoding()
output = output.decode()
print ("The data type of my variable is now:", type(output), '\n')
print(output)
output = output.strip() ## remember, this gets rid of the last '\n' in the string.
2.5 Although the output already looks like a list, it is in fact just a string. You can check it with a print (output[0]). We will therefore convert it now into a list using the function split().
What is new?
pdflist = output.split("\n")
print ("The directory",pdfpath,"has the follwing content:")
## now print the output to the screen (stdout)
for entry in pdflist:
print ('filename is:', entry)
2.6 In the next step we want to now apply the subprocess call to convert a PDF into a plain text file. This is necessary as there is no straightforward method in Python to extract contents directly from a PDF. To do so, we will first need to specify the filename for the output file. we want to name the outfile just as the infile but with a different ending. So we use the function split() and define the period '.' as the pattern where to split the string. As a filename can contain more than one '.' we will store everythin up to the last '.' in a list filecomp (see above), and the file ending in the variable ending. Subsequently, we concatenate all list elements and append the file ending '.txt'
What is new?
## determine the output path
txtpath = 'data/pdf-exercise/txtlib'
filelist = {}
for pdf in pdflist:
##
*filecomp, ending = pdf.split('.')
filename = ''
for i in range(len(filecomp)):
filename = filename + filecomp[i] + '.'
## now we append the ending '.txt' to the filename
txtfile = filename + 'txt'
filelist[pdf] = txtfile
print (pdf)
print (txtfile)
2.7 We will now proceed to the actual file conversion. For this purpose we will call the external program pdf2txt.py. The syntax for using this Python script in the shell is pdf2txt.py -o paper.txt paper.pdf. We will put together this program call in the script and then execute it via subprocess. Note, as file conversion takes some time, this routine will checke whether the converted txt-file is already existing.
What is new?
import os
## compile the program call
command = 'pdf2txt.py'
for currpdf, currtxt in filelist.items():
if os.path.exists(txtpath + '/' + currtxt):
pass ## the pdf has already been converted and we do nothing
else:
outfilecomp = '-o ' + txtpath + '/' + currtxt
toexecute = command + ' ' + outfilecomp + ' ' + pdfpath + '/' + currpdf
print ("We will execute the following command: ", toexecute)
## execute the program call. Note the shell=True part, that tells Python to execute the command via a shell
p = subprocess.Popen(toexecute, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
out, error = p.communicate() ## we do not really need this line here
print ("Done with conversion")
2.8 Now that we have generated the txt file version of the pdf, we want to analyse the contents of the publications. To do so, we have to first compile a list of keywords that we want to search for. We will generate a custom function createDict to extract the keywords from a comma-separated string input by the user. These serve then as keys in a dictionary.
What is new?
######## a small function to turn elements of a comma separated string into a dictionary
## Note the default values for the _str variable key
def createDict(keystring='bioinformatics, phylogeny, tree, alignment, evolution'):
keywordlist = keystring.split(',')
keyworddict = {}
for entry in keywordlist:
entry = entry.strip()
keyworddict[entry] = 0
return(keyworddict)
## function cleanDict resets the values of a dictionary to 0 for all keys
def resetCounts(keyworddict):
for key in keyworddict:
keyworddict[key] = 0
return keyworddict
######## End of the custom function definition
## Prompt the user for input
keys = input("Please provide a comma separated list of keywords or leave empty for the default keywords:")
if len(keys) > 0:
keywords = createDict(keys)
else:
# stick with the default list specified in the definition of createDict
keywords = createDict()
keylist = keywords.keys()
for item in keylist:
print (item)
2.9 As a last step, we need to load the contents of the individual converted PDF files into memory and search for the keywords.
What is new?
import codecs
count = 0
result = []
for currpdf in pdflist:
print ("\t", currpdf)
currtxt = filelist[currpdf]
filehandle = codecs.open(txtpath + '/' + currtxt,'r', encoding='UTF-8')
content = []
content = filehandle.readlines()
for lines in content:
for entry in keywords.keys():
if entry in lines:
keywords[entry] += 1
intlist = []
for entry in keylist:
print ("\t\t", entry, ':', keywords[entry], "times")
intlist.append(keywords[entry])
result.append(intlist)
count += 1
keywords = resetCounts(keywords)
2.10 Lastly, we will now visualize our results in a heatmap using numpy and matplotlib.pyplot
What is new?
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
## we will have to transform the nested list _result_ from the previous routine
## into an array
data = np.array(result)
## we hand now the data over to matplotlib.pyplot to generate an image object
image = plt.imshow(data)
## Now we modify the tick labels of the x- and the y-axis
## to match the pdfnames and the keywords
plt.xticks (np.arange(len(keylist)), (keylist), rotation = -90)
plt.yticks( np.arange(len(pdflist)), (pdflist) )
## and we plot the heatmap
plt.show()