Common issues with CSV parsing and solutions to them
Although CSV is a straightforward way of formatting the data that later can be used for import/export actions, you may sometimes run into parsing issues. This article is a handy first aid kit when it comes to Ruby and CSV.
Fixing performance issues
Problem: you are parsing a large CSV file, and it takes a lot of time to process the whole file, or your server is running out of memory when the process is in the middle.
Solution: when processing CSV files with Ruby, always use the CSV.foreach
method instead of CSV.read
or CSV.parse
. The reason is simple: while read
and parse
methods load the whole file into the memory and then process it, the foreach
method process the file line by line without loading all contents at once. By iterating the file line by line, you won't use much memory, and you would get access to the rows immediately.
Imagine that you are looking for one row in a file that consists of millions of rows. With CSV.foreach
, you won't load all rows at once and stop processing as soon as you find the row you are looking for.
Fixing formatting issues
Problem: you are parsing a CSV file where some of the values are integers, dates, or formats that require special parsing. By default, you would receive all values as strings, and you would have to perform additional parsing on your own.
Solution: to speed up the parsing process, you can use preprocessors. A preprocessor is a simple anonymous function that formats the data when parsing a CSV file.
By default, we have the following preprocessors available in the standard library:
:integer
- parse values that are integers:float
- parse values that are floats:numeric
- parse values that are numbers (integers or floats):date
- parse values that are dates:date_time
- parse values that are dates with the time:all
- use all available preprocessors
As I mentioned before, preprocessor is just a simple anonymous function so you can call it directly using the call
method:
CSV::Converters[:float].call("2.1")
# => 2.1
When parsing CSV you can define which preprocessors you would like to use by passing converters
option:
csv = "first_name,age,birth_date\nJohn,30,2000-01-15"
CSV.parse(csv, headers: true, converters: [:integer, :date_time]).first
# => #<CSV::Row "first_name":"John" "age":30 "birth_date":#<DateTime: 2000-01-15T00:00:00+00:00 ((2451559j,0s,0n),+0s,2299161j)>>
As you can see in the above example, both age
and birth_date
attributes were appropriately formatted.
Custom parsing
If you want to parse something more than just a simple number or date, you can write your own converter in seconds. If you would like to parse every value that is URL, you can write the following code:
url_converter = ->(val) {
val.to_s.match(/http|https/) ? URI.parse(val) : val
}
CSV.parse("first_name,url\nJohn,https://google.com", converters: [url_converter])
# => [["first_name", "url"], ["John", #<URI::HTTPS https://google.com>]]
Fixing encoding issues
Problem: you would like to parse a CSV file that contains special characters, but you get the CSV::MalformedCSVError
error at every attempt to parse the file’s contents
Solution: specify the file’s encoding when parsing the file. You can do this by passing the encoding
option with the name of the encoding:
CSV.read("./users.csv", encoding: "ISO-8859-1", headers: true)
Fixing duplications issues
Problem: you have to process the file where one header is duplicated, but you would like to keep all values because they are different.
Solution: pass the headers
option and iterate over each row without transforming it into the hash because the standard library would remove the duplications automatically:
CSV.parse("first_name,role,role\nJohn,admin,user", headers: true).first.to_h
# => {"first_name"=>"John", "role"=>"admin"}
To keep all values from the duplicated column you can use the following approach:
data = []
CSV.parse("first_name,role,role\nJohn,admin,user", headers: true).each do |row|
attributes = {}
row.each_pair do |column, value|
attributes[column] ||= []
attributes[column] << value
end
data << attributes
end
data.map { |d| d.transform_values { |val| val.size == 1 ? val.first : val } }
# => [{"first_name"=>"John", "role"=>["admin", "user"]}]
It is helpful to know how the CSV library would behave if it would come to parsing duplicated values.