TextParse

TextParse uses Julia's generated functions to generate efficient specialized parsers for text files. TextParse minimizes allocations and hence avoids involving the GC.

TextParse

Installation
Reading CSV
Extensible parsing framework

Installation

Pkg.add("TextParse")

Reading CSV

The most useful API is probably csvread - read a CSV file:

TextParse.csvread — Function.

csvread(file::Union{String,IO}, delim=','; <arguments>...)

Read CSV from file. Returns a tuple of 2 elements:

A tuple of columns each either a Vector, or StringArray
column names if header_exists=true, empty array otherwise

Arguments:

file: either an IO object or file name string
delim: the delimiter character
spacedelim: (Bool) parse space-delimited files. delim has no effect if true.
quotechar: character used to quote strings, defaults to "
escapechar: character used to escape quotechar in strings. (could be the same as quotechar)
commentchar: ignore lines that begin with commentchar
row_estimate: estimated number of rows in the file. Defaults to 0 in which case we try to estimate this.
skiplines_begin: skips specified number of lines at the beginning of the file
header_exists: boolean specifying whether CSV file contains a header
nastrings: strings that are to be considered NA. Defaults to TextParse.NA_STRINGS
colnames: manually specified column names. Could be a vector or a dictionary from Int index (the column) to String column name.
colparsers: Parsers to use for specified columns. This can be a vector or a dictionary from column name / column index (Int) to a "parser". The simplest parser is a type such as Int, Float64. It can also be a dateformat"...", see CustomParser if you want to plug in custom parsing behavior. If you pass nothing as the parser for a given column, that column will be skipped
type_detect_rows: number of rows to use to infer the initial colparsers defaults to 20.

source

Some notable features of the CSV parser are:

CSV parsing kernel generated by TextParse is type-inferable.
Uses PooledArrays for strings by default, promoting to an Array only if number of unique elements goes over 5% (after 400 rows have been read).
Avoids allocating the string in PooledArray unless the string is not in the pool
Doesn't assume all columns are nullable by default, switches column to DataValueArray if an NA value is found
Flexible about predicted column types, can convert the column mid-way if the type changes and switch to a new fast generated method
Fast date time parsing even on Julia 0.5

TextParse operates by defining small parsers which are specialized to parse one kind of text very efficiently. Each such parser is described by a subtype of AbstractToken{T}. An AbstractToken{T} type should implement a tryparsenext method:

Dates.tryparsenext — Function.

tryparsenext{T}(tok::AbstractToken{T}, str, i, till, localopts)

Parses the string str starting at position i and ending at or before position till. localopts is a LocalOpts object which contains contextual options for quoting and NA parsing. (see LocalOpts documentation)

tryparsenext returns a tuple (result, nextpos) where result is of type Nullable{T}, Nullable{T}() if parsing failed, non-null containing the parsed value if it succeeded. If parsing succeeded, nextpos is the position the next token, if any, starts at. If parsing failed, nextpos is the position at which the parsing failed.

source

Available AbstractToken types

TextParse.Numeric — Type.

parse numbers of type T

source

TextParse.StringToken — Type.

Parses string to the AbstractString type T. If T is StrRange returns a StrRange with start position (offset) and length of the substring. It is used internally by csvparse for avoiding allocating strings.

source

TextParse.DateTimeToken — Type.

DateTimeToken(T, fmt::DateFormat)

Parse a date time string of format fmt into type T which is either Date, Time or DateTime.

source

TextParse.NAToken — Type.

NAToken(inner::AbstractToken; options...)

Parses a Nullable item.

Arguments

inner: the token to parse if non-null.
emptyisna: should an empty item be considered NA? defaults to true
nastrings: strings that are to be considered NA. Defaults to ["#N/A", "#N/A N/A", "#NA", "#n/a", "#n/a n/a", "#na", "-1.#IND", "-1.#QNAN", "-1.#ind", "-1.#qnan", "-NaN", "-nan", "-nan", "-nan", "1.#IND", "1.#QNAN", "1.#ind", "1.#qnan", "N/A", "N/A", "NA", "NA", "NULL", "NaN", "n/a", "n/a", "na", "na", "nan", "nan", "nan", "null"]

source

TextParse.Quoted — Type.

Quoted(inner::AbstractToken; <kwargs>...)

Arguments:

inner: The token inside quotes to parse
required: are quotes required for parsing to succeed? defaults to false
includequotes: include the quotes in the output. Defaults to false
includenewlines: include newlines that appear within quotes. Defaults to true
quotechar: character to use to quote (default decided by LocalOpts)
escapechar: character that escapes the quote char (default set by LocalOpts)

source

TextParse.CustomParser — Type.

CustomParser(f, T)

Provide a custom parsing mechanism.

Arguments:

f: the parser function
T: The type of the parsed value

The parser function must take the following arguments:

str: the entire string being parsed
pos: the position in the string at which to start parsing
len: the length of the string the maximum position where to parse till
opts: a LocalOpts object with options local to the current field.

The parser function must return a tuple of two values:

result: A Nullable{T}. Set to Nothing{T}() if parsing must fail, containing the value otherwise.
nextpos: If parsing succeeded this must be the next position after parsing finished, if it failed this must be the position at which parsing failed.

source

LocalOpts

TextParse.LocalOpts — Type.

LocalOpts

Options local to the token currently being parsed.

endchar: Till where to parse. (e.g. delimiter or quote ending character)
spacedelim: Treat spaces as delimiters
quotechar: the quote character
escapechar: char that escapes the quote
includequotes: whether to include quotes while parsing
includenewlines: whether to include newlines while parsing

source

TextParse

Installation

Reading CSV

Extensible parsing framework

Available AbstractToken types

LocalOpts