@@ -3279,88 +3279,89 @@ External Compatibility
32793279``HDFStore `` writes ``table `` format objects in specific formats suitable for
32803280producing loss-less round trips to pandas objects. For external
32813281compatibility, ``HDFStore `` can read native ``PyTables `` format
3282- tables. It is possible to write an ``HDFStore `` object that can easily
3283- be imported into ``R `` using the
3282+ tables.
3283+
3284+ It is possible to write an ``HDFStore `` object that can easily be imported into ``R `` using the
32843285``rhdf5 `` library (`Package website `_). Create a table format store like this:
32853286
32863287.. _package website : http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html
32873288
3288- .. ipython :: python
3289+ .. ipython :: python
3290+
3291+ np.random.seed(1 )
3292+ df_for_r = pd.DataFrame({" first" : np.random.rand(100 ),
3293+ " second" : np.random.rand(100 ),
3294+ " class" : np.random.randint(0 , 2 , (100 ,))},
3295+ index = range (100 ))
3296+ df_for_r.head()
32893297
3290- np.random.seed(1 )
3291- df_for_r = pd.DataFrame({" first" : np.random.rand(100 ),
3292- " second" : np.random.rand(100 ),
3293- " class" : np.random.randint(0 , 2 , (100 ,))},
3294- index = range (100 ))
3295- df_for_r.head()
3298+ store_export = HDFStore(' export.h5' )
3299+ store_export.append(' df_for_r' , df_for_r, data_columns = df_dc.columns)
3300+ store_export
32963301
3297- store_export = HDFStore(' export.h5' )
3298- store_export.append(' df_for_r' , df_for_r, data_columns = df_dc.columns)
3299- store_export
3302+ .. ipython :: python
3303+ :suppress:
33003304
3301- .. ipython :: python
3302- :suppress:
3305+ store_export.close()
3306+ import os
3307+ os.remove(' export.h5' )
33033308
3304- store_export.close()
3305- import os
3306- os.remove(' export.h5' )
3307-
33083309 In R this file can be read into a ``data.frame `` object using the ``rhdf5 ``
33093310library. The following example function reads the corresponding column names
33103311and data values from the values and assembles them into a ``data.frame ``:
33113312
3312- .. code-block :: R
3313-
3314- # Load values and column names for all datasets from corresponding nodes and
3315- # insert them into one data.frame object.
3316-
3317- library(rhdf5)
3318-
3319- loadhdf5data <- function(h5File) {
3320-
3321- listing <- h5ls(h5File)
3322- # Find all data nodes, values are stored in *_values and corresponding column
3323- # titles in *_items
3324- data_nodes <- grep("_values", listing$name)
3325- name_nodes <- grep("_items", listing$name)
3326- data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
3327- name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
3328- columns = list()
3329- for (idx in seq(data_paths)) {
3330- # NOTE: matrices returned by h5read have to be transposed to to obtain
3331- # required Fortran order!
3332- data <- data.frame(t(h5read(h5File, data_paths[idx])))
3333- names <- t(h5read(h5File, name_paths[idx]))
3334- entry <- data.frame(data)
3335- colnames(entry) <- names
3336- columns <- append(columns, entry)
3337- }
3338-
3339- data <- data.frame(columns)
3340-
3341- return(data)
3342- }
3313+ .. code-block :: R
3314+
3315+ # Load values and column names for all datasets from corresponding nodes and
3316+ # insert them into one data.frame object.
3317+
3318+ library(rhdf5)
3319+
3320+ loadhdf5data <- function(h5File) {
3321+
3322+ listing <- h5ls(h5File)
3323+ # Find all data nodes, values are stored in *_values and corresponding column
3324+ # titles in *_items
3325+ data_nodes <- grep("_values", listing$name)
3326+ name_nodes <- grep("_items", listing$name)
3327+ data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
3328+ name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
3329+ columns = list()
3330+ for (idx in seq(data_paths)) {
3331+ # NOTE: matrices returned by h5read have to be transposed to to obtain
3332+ # required Fortran order!
3333+ data <- data.frame(t(h5read(h5File, data_paths[idx])))
3334+ names <- t(h5read(h5File, name_paths[idx]))
3335+ entry <- data.frame(data)
3336+ colnames(entry) <- names
3337+ columns <- append(columns, entry)
3338+ }
3339+
3340+ data <- data.frame(columns)
3341+
3342+ return(data)
3343+ }
33433344
33443345 Now you can import the ``DataFrame `` into R:
33453346
3346- .. code-block :: R
3347-
3348- > data = loadhdf5data("transfer.hdf5")
3349- > head(data)
3350- first second class
3351- 1 0.4170220047 0.3266449 0
3352- 2 0.7203244934 0.5270581 0
3353- 3 0.0001143748 0.8859421 1
3354- 4 0.3023325726 0.3572698 1
3355- 5 0.1467558908 0.9085352 1
3356- 6 0.0923385948 0.6233601 1
3357-
3347+ .. code-block :: R
3348+
3349+ > data = loadhdf5data("transfer.hdf5")
3350+ > head(data)
3351+ first second class
3352+ 1 0.4170220047 0.3266449 0
3353+ 2 0.7203244934 0.5270581 0
3354+ 3 0.0001143748 0.8859421 1
3355+ 4 0.3023325726 0.3572698 1
3356+ 5 0.1467558908 0.9085352 1
3357+ 6 0.0923385948 0.6233601 1
3358+
33583359 .. note ::
33593360 The R function lists the entire HDF5 file's contents and assembles the
33603361 ``data.frame `` object from all matching nodes, so use this only as a
33613362 starting point if you have stored multiple ``DataFrame `` objects to a
33623363 single HDF5 file.
3363-
3364+
33643365Backwards Compatibility
33653366~~~~~~~~~~~~~~~~~~~~~~~
33663367
@@ -3374,53 +3375,53 @@ method ``copy`` to take advantage of the updates. The group attribute
33743375number of options, please see the docstring.
33753376
33763377
3377- .. ipython :: python
3378- :suppress:
3378+ .. ipython :: python
3379+ :suppress:
33793380
3380- import os
3381- legacy_file_path = os.path.abspath(' source/_static/legacy_0.10.h5' )
3381+ import os
3382+ legacy_file_path = os.path.abspath(' source/_static/legacy_0.10.h5' )
33823383
3383- .. ipython :: python
3384+ .. ipython :: python
33843385
3385- # a legacy store
3386- legacy_store = HDFStore(legacy_file_path,' r' )
3387- legacy_store
3386+ # a legacy store
3387+ legacy_store = HDFStore(legacy_file_path,' r' )
3388+ legacy_store
33883389
3389- # copy (and return the new handle)
3390- new_store = legacy_store.copy(' store_new.h5' )
3391- new_store
3392- new_store.close()
3390+ # copy (and return the new handle)
3391+ new_store = legacy_store.copy(' store_new.h5' )
3392+ new_store
3393+ new_store.close()
33933394
3394- .. ipython :: python
3395- :suppress:
3395+ .. ipython :: python
3396+ :suppress:
33963397
3397- legacy_store.close()
3398- import os
3399- os.remove(' store_new.h5' )
3398+ legacy_store.close()
3399+ import os
3400+ os.remove(' store_new.h5' )
34003401
34013402
34023403 Performance
34033404~~~~~~~~~~~
34043405
3405- - ``Tables `` come with a writing performance penalty as compared to
3406- regular stores. The benefit is the ability to append/delete and
3407- query (potentially very large amounts of data). Write times are
3408- generally longer as compared with regular stores. Query times can
3409- be quite fast, especially on an indexed axis.
3410- - You can pass ``chunksize=<int> `` to ``append ``, specifying the
3411- write chunksize (default is 50000). This will significantly lower
3412- your memory usage on writing.
3413- - You can pass ``expectedrows=<int> `` to the first ``append ``,
3414- to set the TOTAL number of expected rows that ``PyTables `` will
3415- expected. This will optimize read/write performance.
3416- - Duplicate rows can be written to tables, but are filtered out in
3417- selection (with the last items being selected; thus a table is
3418- unique on major, minor pairs)
3419- - A ``PerformanceWarning `` will be raised if you are attempting to
3420- store types that will be pickled by PyTables (rather than stored as
3421- endemic types). See
3422- `Here <http://stackoverflow.com/questions/14355151/how-to-make-pandas-hdfstore-put-operation-faster/14370190#14370190 >`__
3423- for more information and some solutions.
3406+ - ``tables `` format come with a writing performance penalty as compared to
3407+ `` fixed `` stores. The benefit is the ability to append/delete and
3408+ query (potentially very large amounts of data). Write times are
3409+ generally longer as compared with regular stores. Query times can
3410+ be quite fast, especially on an indexed axis.
3411+ - You can pass ``chunksize=<int> `` to ``append ``, specifying the
3412+ write chunksize (default is 50000). This will significantly lower
3413+ your memory usage on writing.
3414+ - You can pass ``expectedrows=<int> `` to the first ``append ``,
3415+ to set the TOTAL number of expected rows that ``PyTables `` will
3416+ expected. This will optimize read/write performance.
3417+ - Duplicate rows can be written to tables, but are filtered out in
3418+ selection (with the last items being selected; thus a table is
3419+ unique on major, minor pairs)
3420+ - A ``PerformanceWarning `` will be raised if you are attempting to
3421+ store types that will be pickled by PyTables (rather than stored as
3422+ endemic types). See
3423+ `Here <http://stackoverflow.com/questions/14355151/how-to-make-pandas-hdfstore-put-operation-faster/14370190#14370190 >`__
3424+ for more information and some solutions.
34243425
34253426Experimental
34263427~~~~~~~~~~~~
0 commit comments