Read a random line in a (large) file in Python

I've been working on a project which requires that I read random lines from a set of files. Unfortunately these files range between 200MB and 2GB so reading the entire file into memory is incredibly slow if possible at all. I looked around for a simple way to do this with large files and I was unable to find a good one, so here you go:

#!/usr/bin/python

import os,random

filename="averylargefile"
file = open(filename,'r')

#Get the total file size
file_size = os.stat(filename)[6]

while 1:
      #Seek to a place in the file which is a random distance away
      #Mod by file size so that it wraps around to the beginning
      file.seek((file.tell()+random.randint(0,file_size-1))%file_size)

      #dont use the first readline since it may fall in the middle of a line
      file.readline()
      #this will return the next (complete) line from the file
      line = file.readline()

      #here is your random line in the file
      print line


Using this method has shown to be MUCH faster then doing something along the lines of:
for line in file:
     #do something
and furthermore, this method does not make it easy to get a random entry in the file.


Enjoy!

Update: Just a note on the "file.seek" line above, the "file.tell()" call is not strictly necessary since what line you are on currently has no bearing on the next line you will be going to. That would also remove the need for the mod operator since the randint would never be larger than the file. Furthermore, since you are advancing the file pointer to a random spot in the file, if all lines are not an equal length then you will not get true randomness, longer lines will in fact be more likely to be hit. Since, however, the algorithm takes the line after random spot seeked to in the file, this issue is lessened but not resolved. I'm looking for a better and more efficient solution in the meantime.

3 comments:

Gr33n3gg said...

Hmm nice, I'll try it next time when I have some free time.

Bryce Boe said...

Jon,

Try the following a la a question Adam had from a google interview:

----
#!/usr/bin/env python
import random, sys

file = open(sys.argv[1])
count = 0
to_print = cur = file.readline()
while cur:
cur = file.readline()
count += 1
if random.randint(0,count) == 0: # 1/count chance to replace
to_print = cur

print to_print[:-1] # Remove new line at end
---

This guarantees each line has equal probability. For instance the probability that the first line is chosen is:
P(0) = 1/1 * 1/2 * 2/3 * 3/4 * ... * (n-1)/n
which if I'm not mistaken results in 1/n.

If you wanted to do random lines many times you could either repeat this multiple times, or alternatively keep x separate random lines each of which you would calculate a different random value for. To slightly optimize that rather than storing the line in memory you could just store the seek value of the start of the line.

lrq3000 said...

Thank's a lot for this clever code!

Here's a little modification to enhance it a bit, because it will never read the first line of the file since you always skip the first read line:

#dont use the first readline since it may fall in the middle of a line UNLESS we are at the beginning of the file
if file.tell() > 0:
file.readline()

Post a Comment