Sometimes you will want to read files without relying on external libraries. This can be achieved using the standard Python function open
, which (surprise) opens a file so that you can either read or write it.
file_to_read = open("data/inflammation-01.csv","r") # this opens it with read-access
file_to_write = open("data/inflammation-new.csv","w") # this opens it with write-access
If you open
a file with write-access it will overwrite any existing files! If you want to append new things to an existing file use a
rather than w
.
Since Python 3 there is also x
as a possible mode, which writes to a file only if it doesn't exist already, this should prevent many tears that otherwise would have been shed over lost data.
file_to_append = open("data/inflammation-new.csv","a") # append
file_to_write = open("data/inflammation-new.csv","x") # write if not existing
Now that we've opened our file, what can we do with it in order to read the contents. Python allows for two easy ways to read a file, line-by-line. Let's discuss the easiest way first:
file_to_read = open("data/inflammation-01.csv","r")
file_content = file_to_read.readlines()
print(file_content[:2])
Using the readlines()
function of an open file you can read in the whole content of the file in one go. The output of this function is a list in which each list item represents a single line of your file.
print([file_content[0]])
If you look at a single line in your resulting list you will notice that they all end with a \n
. This is a special character denoting the end of a line, also called line break. (Depending on whether you use MacOS or Windows this can also be either a \r
or even a \r\n
.
Python comes with a function to remove those trailing special characters, which you can use on strings, called strip()
print([file_content[0]])
print([file_content[0].strip()])
Unlike when using numpy
we now have the trouble that our measurements for each patient are still a single string instead of separate, numeric entries. Python strings offers the function split()
to deal with this.
line = file_content[0].strip()
print(line)
splitted_line = line.split(",")
print(splitted_line)
Split accepts a single parameter, allowing you to specify the character on which the string should be split. (If you need to split at a TAB
use \t
as the special character).
Unfortunately this still doesn't give us what we were looking for, as the individual measurements are still strings and not numbers. We now could loop over each item in splitted_line
and change the type, but that would be cumbersome. Instead we can use a trick, called list comprehensions, which allows to perform operations on each element in a list:
line = file_content[0].strip()
converted_line = [float(element) for element in line.split(",")]
print(converted_line)
Earlier we talked about there being two ways to read in a file. Using .readlines()
will read in the complete file content at the same time. Depending on the file size this can be a problem for your computers memory and you would rather go line by line. To achieve this you can easily just create a for
loop using your opened file.
for line in open("data/small-01.csv","r"):
print(line.strip())
Together with open(filename,"w")
this will allow you to manipulate the content of a file and write the output of the manipulation without the need to store everything in the active memory at once.
This will also allow you to perform operations on each line in a single go.
for line in open("data/small-01.csv","r"):
processed_line = [float(element) for element in line.strip().split(",")]
print(processed_line)
If you didn't use the for line in open()
structure but assigned a variable to your open file (e.g. myfile = open("data/small-01.csv")
) you will have to close your file.
For writing files this is also the point in time when your output is actually written to the file.
my_file = open("data/small-01.csv","r")
lines = my_file.readlines()
my_file.close()
Now let's try writing something to a file and read it back in.
file_to_write = open("data/inflammation-new.csv","w")
file_to_write.write("this is some text.\n and here is some more.")
file_to_write.close()
file_to_read = open("data/inflammation-new.csv","r")
lines = file_to_read.readlines()
print(lines)
Ok, this worked just fine. But what happens if we try to write one of our processed lines to a new file?
processed_line = [0.0, 0.0, 1.0]
file_to_write = open("data/inflammation-new.csv","w")
file_to_write.write(processed_line)
We can only write strings to a file, this is why it does not work. In order to write our processed data we have to reconvert each element of the list into a string. We can again use list comprehensions to do this.
To convert a list into a comma-separated string we can use the join()
function.
processed_line = [0.0, 0.0, 1.0]
processed_line = [str(element) for element in processed_line]
print(processed_line)
processed_line = ",".join(processed_line)
print(processed_line)
# you can not only use ",". Every string is ok!
print(" NEXTELEMENT ".join(["a","b","c"]))
for line in open("data/small-01.csv","r"):
processed_line = [float(element) for element in line.strip().split(",")]
my_file = open("data/small-01.csv","r")
file_content = my_file.readlines()
print(file_content[0])
for line in my_file:
print(line)
data/merge-1.csv
and data/merge-2.csv
. Each line in the file looks like this: File 1
patient_a,1,0,2,3,4
patient_b,2,3,1,0,23
…
File 2
patient_c,3,0,1,1,2
patient_a,4,5,3,0,2
…
Each patient is represented in both files, but they are not ordered. You want to merge the data for each patient from the two files into a single one and save the output.
The result should be a single line per patient and contain 10 values per patient