TextParse
TextParse uses Julia's generated functions to generate efficient specialized parsers for text files. TextParse minimizes allocations and hence avoids involving the GC.
Installation
Pkg.add("TextParse")
Reading CSV
The most useful API is probably csvread
- read a CSV file:
TextParse.csvread
— Function.csvread(file::Union{String,IO}, delim=','; <arguments>...)
Read CSV from file
. Returns a tuple of 2 elements:
- A tuple of columns each either a
Vector
, orStringArray
- column names if
header_exists=true
, empty array otherwise
Arguments:
file
: either an IO object or file name stringdelim
: the delimiter characterspacedelim
: (Bool) parse space-delimited files.delim
has no effect if true.quotechar
: character used to quote strings, defaults to"
escapechar
: character used to escape quotechar in strings. (could be the same as quotechar)commentchar
: ignore lines that begin with commentcharrow_estimate
: estimated number of rows in the file. Defaults to0
in which case we try to estimate this.skiplines_begin
: skips specified number of lines at the beginning of the fileheader_exists
: boolean specifying whether CSV file contains a headernastrings
: strings that are to be considered NA. Defaults toTextParse.NA_STRINGS
colnames
: manually specified column names. Could be a vector or a dictionary from Int index (the column) to String column name.colparsers
: Parsers to use for specified columns. This can be a vector or a dictionary from column name / column index (Int) to a "parser". The simplest parser is a type such as Int, Float64. It can also be adateformat"..."
, see CustomParser if you want to plug in custom parsing behavior. If you passnothing
as the parser for a given column, that column will be skippedtype_detect_rows
: number of rows to use to infer the initialcolparsers
defaults to 20.
Some notable features of the CSV parser are:
- CSV parsing kernel generated by TextParse is type-inferable.
- Uses PooledArrays for strings by default, promoting to an
Array
only if number of unique elements goes over 5% (after 400 rows have been read). - Avoids allocating the string in PooledArray unless the string is not in the pool
- Doesn't assume all columns are nullable by default, switches column to
DataValueArray
if an NA value is found - Flexible about predicted column types, can convert the column mid-way if the type changes and switch to a new fast generated method
- Fast date time parsing even on Julia 0.5
Extensible parsing framework
TextParse operates by defining small parsers which are specialized to parse one kind of text very efficiently. Each such parser is described by a subtype of AbstractToken{T}
. An AbstractToken{T}
type should implement a tryparsenext
method:
Dates.tryparsenext
— Function.tryparsenext{T}(tok::AbstractToken{T}, str, i, till, localopts)
Parses the string str
starting at position i
and ending at or before position till
. localopts
is a LocalOpts object which contains contextual options for quoting and NA parsing. (see LocalOpts documentation)
tryparsenext
returns a tuple (result, nextpos)
where result
is of type Nullable{T}
, Nullable{T}()
if parsing failed, non-null containing the parsed value if it succeeded. If parsing succeeded, nextpos
is the position the next token, if any, starts at. If parsing failed, nextpos
is the position at which the parsing failed.
Available AbstractToken types
TextParse.Numeric
— Type.parse numbers of type T
TextParse.StringToken
— Type.Parses string to the AbstractString type T
. If T
is StrRange
returns a StrRange
with start position (offset
) and length
of the substring. It is used internally by csvparse
for avoiding allocating strings.
TextParse.DateTimeToken
— Type.DateTimeToken(T, fmt::DateFormat)
Parse a date time string of format fmt
into type T
which is either Date
, Time
or DateTime
.
TextParse.NAToken
— Type.NAToken(inner::AbstractToken; options...)
Parses a Nullable item.
Arguments
inner
: the token to parse if non-null.emptyisna
: should an empty item be considered NA? defaults to truenastrings
: strings that are to be considered NA. Defaults to["#N/A", "#N/A N/A", "#NA", "#n/a", "#n/a n/a", "#na", "-1.#IND", "-1.#QNAN", "-1.#ind", "-1.#qnan", "-NaN", "-nan", "-nan", "-nan", "1.#IND", "1.#QNAN", "1.#ind", "1.#qnan", "N/A", "N/A", "NA", "NA", "NULL", "NaN", "n/a", "n/a", "na", "na", "nan", "nan", "nan", "null"]
TextParse.Quoted
— Type.Quoted(inner::AbstractToken; <kwargs>...)
Arguments:
inner
: The token inside quotes to parserequired
: are quotes required for parsing to succeed? defaults tofalse
includequotes
: include the quotes in the output. Defaults tofalse
includenewlines
: include newlines that appear within quotes. Defaults totrue
quotechar
: character to use to quote (default decided byLocalOpts
)escapechar
: character that escapes the quote char (default set byLocalOpts
)
TextParse.CustomParser
— Type.CustomParser(f, T)
Provide a custom parsing mechanism.
Arguments:
f
: the parser functionT
: The type of the parsed value
The parser function must take the following arguments:
str
: the entire string being parsedpos
: the position in the string at which to start parsinglen
: the length of the string the maximum position where to parse tillopts
: a LocalOpts object with options local to the current field.
The parser function must return a tuple of two values:
result
: ANullable{T}
. Set toNothing{T}()
if parsing must fail, containing the value otherwise.nextpos
: If parsing succeeded this must be the next position after parsing finished, if it failed this must be the position at which parsing failed.
LocalOpts
TextParse.LocalOpts
— Type.LocalOpts
Options local to the token currently being parsed.
endchar
: Till where to parse. (e.g. delimiter or quote ending character)spacedelim
: Treat spaces as delimitersquotechar
: the quote characterescapechar
: char that escapes the quoteincludequotes
: whether to include quotes while parsingincludenewlines
: whether to include newlines while parsing