We spend a lot of time waiting for some data preparation task to finish —the destiny of data scientists, you would say. Well, we can speed things up. Here are two techniques that will come handy: memory mapped files and multithreading.
The data
I had recently to extract terms and term frequencies from the Google Books Ngram corpus and found myself wondering if there are ways to speed up the task. The corpus consists of twenty-six files totalling 24GB of data. Each of the files I was interested in contains a term and other meta data, tab separated. The brute force approach of reading these files as pandas data frames was … slow. Since we wanted only the unique terms and their match counts, I thought I would try to make it faster :-)
Memory mapped files
This technique is not new. It has been around for a long time and originated in Unix (before Linux!). Briefly, mmap
bypasses the usual I/O buffering by loading the contents of a file into pages of memory. This works very well for computers with large memory footprints. That’s mostly OK with today’s desktops and laptops where having 32GB of memory is not anymore in the esoteric department. The Python library mimics most of the Unix functionality and offers a handy readline()
function to extract the bytes one line at a time.
# map the entire file into memory
mm = mmap.mmap(fp.fileno(), 0)# iterate over the block, until next newline
for line in iter(mm.readline, b""):
# convert the bytes to a utf-8 string and split the fields
term = line.decode("utf-8").split("\t")
The fp
is a file-pointer that was previously opened with the r+b
access attribute. There you go, with this simple tweak you have made file reading twice as fast (well, the exact improvement will depend on a lot of things such as disk HW, etc).
Multithreading
The next technique that always helps in making things faster is adding parallelism. In our case, the task was I/O bound. That is a good fit for scaling-up —i.e. adding threads. You will find good discussions on when it is better to scale-out (multi-processing) on search engines.
Python3 has a great standard library for managing a pool of threads and dynamically assign tasks to them. All with an incredibly simple API.
# use as many threads as possible, default: os.cpu_count()+4
with ThreadPoolExecutor() as threads:
t_res = threads.map(process_file, files)
The default value of max_workers
for ThreadPoolExecutor
is 5 threads per CPU core (as of Python v3.8). The map()
API will receive a function to be applied to each member of a list and will run the function automatically when threads become available. Wow. That simple. In less than fifty minutes I had converted the 24GB input into a handy 75MB dataset to be analysed with pandas—voilĂ .
The complete code is on GitHub. Comments and remarks are always welcome.
PS: I added a progress bar with tqdm
for each thread. I really don’t know how they manage to avoid scrambling of the lines on the screen … It works like a charm.
UPDATE: Two years later, this came up :-)
No comments:
Post a Comment