TextParse
TextParse uses Julia's generated functions to generate efficient specialized parsers for text files. TextParse minimizes allocations and hence avoids involving the GC.
Installation
Pkg.add("TextParse")Reading CSV
The most useful API is probably csvread - read a CSV file:
TextParse.csvread — Function.csvread(file::Union{String,IO}, delim=','; <arguments>...)Read CSV from file. Returns a tuple of 2 elements:
- A tuple of columns each either a
Vector, orStringArray - column names if
header_exists=true, empty array otherwise
Arguments:
file: either an IO object or file name stringdelim: the delimiter characterspacedelim: (Bool) parse space-delimited files.delimhas no effect if true.quotechar: character used to quote strings, defaults to"escapechar: character used to escape quotechar in strings. (could be the same as quotechar)commentchar: ignore lines that begin with commentcharrow_estimate: estimated number of rows in the file. Defaults to0in which case we try to estimate this.skiplines_begin: skips specified number of lines at the beginning of the fileheader_exists: boolean specifying whether CSV file contains a headernastrings: strings that are to be considered NA. Defaults toTextParse.NA_STRINGScolnames: manually specified column names. Could be a vector or a dictionary from Int index (the column) to String column name.colparsers: Parsers to use for specified columns. This can be a vector or a dictionary from column name / column index (Int) to a "parser". The simplest parser is a type such as Int, Float64. It can also be adateformat"...", see CustomParser if you want to plug in custom parsing behavior. If you passnothingas the parser for a given column, that column will be skippedtype_detect_rows: number of rows to use to infer the initialcolparsersdefaults to 20.
Some notable features of the CSV parser are:
- CSV parsing kernel generated by TextParse is type-inferable.
- Uses PooledArrays for strings by default, promoting to an
Arrayonly if number of unique elements goes over 5% (after 400 rows have been read). - Avoids allocating the string in PooledArray unless the string is not in the pool
- Doesn't assume all columns are nullable by default, switches column to
DataValueArrayif an NA value is found - Flexible about predicted column types, can convert the column mid-way if the type changes and switch to a new fast generated method
- Fast date time parsing even on Julia 0.5
Extensible parsing framework
TextParse operates by defining small parsers which are specialized to parse one kind of text very efficiently. Each such parser is described by a subtype of AbstractToken{T}. An AbstractToken{T} type should implement a tryparsenext method:
Dates.tryparsenext — Function.tryparsenext{T}(tok::AbstractToken{T}, str, i, till, localopts)
Parses the string str starting at position i and ending at or before position till. localopts is a LocalOpts object which contains contextual options for quoting and NA parsing. (see LocalOpts documentation)
tryparsenext returns a tuple (result, nextpos) where result is of type Nullable{T}, Nullable{T}() if parsing failed, non-null containing the parsed value if it succeeded. If parsing succeeded, nextpos is the position the next token, if any, starts at. If parsing failed, nextpos is the position at which the parsing failed.
Available AbstractToken types
TextParse.Numeric — Type.parse numbers of type T
TextParse.StringToken — Type.Parses string to the AbstractString type T. If T is StrRange returns a StrRange with start position (offset) and length of the substring. It is used internally by csvparse for avoiding allocating strings.
TextParse.DateTimeToken — Type.DateTimeToken(T, fmt::DateFormat)Parse a date time string of format fmt into type T which is either Date, Time or DateTime.
TextParse.NAToken — Type.NAToken(inner::AbstractToken; options...)
Parses a Nullable item.
Arguments
inner: the token to parse if non-null.emptyisna: should an empty item be considered NA? defaults to truenastrings: strings that are to be considered NA. Defaults to["#N/A", "#N/A N/A", "#NA", "#n/a", "#n/a n/a", "#na", "-1.#IND", "-1.#QNAN", "-1.#ind", "-1.#qnan", "-NaN", "-nan", "-nan", "-nan", "1.#IND", "1.#QNAN", "1.#ind", "1.#qnan", "N/A", "N/A", "NA", "NA", "NULL", "NaN", "n/a", "n/a", "na", "na", "nan", "nan", "nan", "null"]
TextParse.Quoted — Type.Quoted(inner::AbstractToken; <kwargs>...)
Arguments:
inner: The token inside quotes to parserequired: are quotes required for parsing to succeed? defaults tofalseincludequotes: include the quotes in the output. Defaults tofalseincludenewlines: include newlines that appear within quotes. Defaults totruequotechar: character to use to quote (default decided byLocalOpts)escapechar: character that escapes the quote char (default set byLocalOpts)
TextParse.CustomParser — Type.CustomParser(f, T)Provide a custom parsing mechanism.
Arguments:
f: the parser functionT: The type of the parsed value
The parser function must take the following arguments:
str: the entire string being parsedpos: the position in the string at which to start parsinglen: the length of the string the maximum position where to parse tillopts: a LocalOpts object with options local to the current field.
The parser function must return a tuple of two values:
result: ANullable{T}. Set toNothing{T}()if parsing must fail, containing the value otherwise.nextpos: If parsing succeeded this must be the next position after parsing finished, if it failed this must be the position at which parsing failed.
LocalOpts
TextParse.LocalOpts — Type.LocalOptsOptions local to the token currently being parsed.
endchar: Till where to parse. (e.g. delimiter or quote ending character)spacedelim: Treat spaces as delimitersquotechar: the quote characterescapechar: char that escapes the quoteincludequotes: whether to include quotes while parsingincludenewlines: whether to include newlines while parsing