Showing posts with label large file. Show all posts
Showing posts with label large file. Show all posts

Read a random line in a (large) file in Python

I've been working on a project which requires that I read random lines from a set of files. Unfortunately these files range between 200MB and 2GB so reading the entire file into memory is incredibly slow if possible at all. I looked around for a simple way to do this with large files and I was unable to find a good one, so here you go:

#!/usr/bin/python

import os,random

filename="averylargefile"
file = open(filename,'r')

#Get the total file size
file_size = os.stat(filename)[6]

while 1:
      #Seek to a place in the file which is a random distance away
      #Mod by file size so that it wraps around to the beginning
      file.seek((file.tell()+random.randint(0,file_size-1))%file_size)

      #dont use the first readline since it may fall in the middle of a line
      file.readline()
      #this will return the next (complete) line from the file
      line = file.readline()

      #here is your random line in the file
      print line


Using this method has shown to be MUCH faster then doing something along the lines of:
for line in file:
     #do something
and furthermore, this method does not make it easy to get a random entry in the file.


Enjoy!

Update: Just a note on the "file.seek" line above, the "file.tell()" call is not strictly necessary since what line you are on currently has no bearing on the next line you will be going to. That would also remove the need for the mod operator since the randint would never be larger than the file. Furthermore, since you are advancing the file pointer to a random spot in the file, if all lines are not an equal length then you will not get true randomness, longer lines will in fact be more likely to be hit. Since, however, the algorithm takes the line after random spot seeked to in the file, this issue is lessened but not resolved. I'm looking for a better and more efficient solution in the meantime.