Skip to content

Ever deal with multiple huge csv files and and the panads read_csv method is slowing you down? You are not alone.

Notifications You must be signed in to change notification settings

donjor/read-csv-turbo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

read-csv-turbo

Ever deal with multiple huge csv files and and the panads read_csv/skiprows method is slowing you down? You are not alone.

read-csv-turbo is an improved method of reading the first and last lines using unix style head and tail commands to get the data you want in a dataframe as fast as possible. Windows and UNIX support

Reading a large csv once is "fine" but often I find myself looping through many files and this process is painfully slow which is why StackOverflow suggestions didn't cut it. There may be a newer/smarter way of approaching this but this method should be as fast as you could get.

At the moment the use case of this is quite limited as it just provides a fast way to read the first, last or n row of a csv into a dataframe

Approach

Uses the standard subprocess python module

Unix : subprocess.check_output(f"head -1 {path}", shell=True).decode("utf-8").strip()

Windows: subprocess.check_output(f"powershell gc {path} -head 1", shell=True).decode("utf-8").strip()

Then reading in the string output using StringIO

string_data = StringIO(f'{csv_header}\n{head}')
df = pd.read_csv(string_data, sep=",")

This then maintains the expected smart object types; meaning that the column types aren't just plain strings.

Installation

pip install readcsvturbo

Usage

import pandas as pd
import readcsvturbo as rct

csv_file_path = Path("./big_csv.csv")

df_head = rct.read_csv_head(csv_file_path)
df_tail = rct.read_csv_tail(csv_file_path)
df_headtail = rct.read_csv_headtail(csv_file_path)
df_specific_line = rct.read_csv_line(csv_file_path)

# Can also specify if there are no headers in the csv file

df_head = rct.read_csv_head(csv_file_path, headers=False)

Speed Test Results

RAW PANDAS TIME: 5.47s
SKIPROWS TIME:   3.04s (1.7x faster)
TURBO TIME:      0.25s (21.8x faster)

Speed Test

Test completed with a csv file 609MB with 5,000,000 lines. Downloaded from https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

The speedtest.py script was run to produce this. The test is reading the first and last lines of a csv file including the header.

############## RAW PANDAS ###############
# Read the whole file in using read_csv #
#########################################

                        Region Country  ...  Total Cost Total Profit
0        Australia and Oceania   Palau  ...  1260428.96    303126.25
4999999                 Europe  Greece  ...  2272551.84    546536.25

[2 rows x 14 columns]
RAW PANDAS TIME: 5.47s


################### SKIPROWS ##################
# Get the length of the file from readlines() #
# Read big_csv using skiprows                 #
###############################################

                  Region Country  ...  Total Cost Total Profit
0  Australia and Oceania   Palau  ...  1260428.96    303126.25
1                 Europe  Greece  ...  2272551.84    546536.25

[2 rows x 14 columns]
SKIPROWS TIME: 3.04s


##################### TURBO #####################
# Read first and last line using read-csv-turbo #
#################################################

                  Region Country  ...  Total Cost Total Profit
0  Australia and Oceania   Palau  ...  1260428.96    303126.25
1                 Europe  Greece  ...  2272551.84    546536.25

[2 rows x 14 columns]
TURBO TIME: 0.25s

About

Ever deal with multiple huge csv files and and the panads read_csv method is slowing you down? You are not alone.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages