Background here, we want to process a big dataset with a high speed, besides calculation, IO is a bottleneck, How to optimize these IO operation to a tolerable time is what we want to get, and there are many numpy and pandas calculations, so the basic data structure is DataFrame in Pandas, This article is organized according to the Peter’s material.
python : 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
pandas : 0.19.1
numpy : 1.11.2
Processor: Intel(R) Xeon(R) CPU X7550 @2.00GHz 2.00GHz
import pandas as pd
# make the test data, a np array of doubles
with timethis('time of pandas.to_csv'):
For peter, it takes 7min 28s on SSD disk, for me, I don’t know for now, it still running - -,OK, it’s done!
|write||448 s||677 s|
tips: need run it in cython environment
cimport libc.stdio as stdio
|write||55.3 s||89.8 s|
No surprise here.
There is no comparison on this, it takes 12 min on peter’s machine, I did not do it on mine!
# put the two step process into a single function
save the test_data directly with np.save to a .npy file:
|write||1.64 s||8.07 s|
save the test_data_df with the function we wrote:
def hdf5_save_df(filename, dfname, df):
|write||3.92 s||22.58 s|
for write piece, write the raw data to .npy file is the fastest in these methods, and save dataframe to file, HDF5 is very efficient.
And another conclusion is my VDI is not so good as peter’s! - -!