Difference between revisions of "ParseCSV"

m
m
 
(3 intermediate revisions by the same user not shown)
Line 8: Line 8:
 
== ParseCSV( csvText'', columnIndex, rowIndex, separator, firstLineIsHeader, trimCells, columnsToKeep, rowIndexLabelColumn, parseFlags, dateTemplate, decimal, quote '' ) ==
 
== ParseCSV( csvText'', columnIndex, rowIndex, separator, firstLineIsHeader, trimCells, columnsToKeep, rowIndexLabelColumn, parseFlags, dateTemplate, decimal, quote '' ) ==
  
Converts text «csvText» from a CSV (Comma Separated Values) file into a two-dimensional array. Usaally, you use this with [[ReadTextFile]]() to read the file:
+
Converts text «csvText» from a CSV (Comma-Separated Values) format into a two-dimensional array. Usually, you use it with [[ReadTextFile]]() to read the file:
  
 
  <code>ParseCSV(ReadTextFile( filename ))</code>
 
  <code>ParseCSV(ReadTextFile( filename ))</code>
  
If you give it no other parameters, it assumes the usual CSV conventions (like Excel) as defaults:  commas to separate columns, newline to separate rows, and double quotes to enclose elements that may contain commas or newlines.  The resulting array has local indexes, <code>.Column</code> and <code>.Row</code>.  Index <code>.Column</code> gets the column headers from the first row of the CSV. code>.Row</code> is numbered 1 to n-1, where n is the number of rows. You can override any of these defaults with optional parameters described below.
+
By default, it assumes the usual CSV conventions (like Excel) as defaults:  comma to separate fields, newline to separate rows, and double quotes to enclose elements that may contain commas or newlines.  The resulting array has local indexes, <code>.Column</code> and <code>.Row</code>.  Index <code>.Column</code> gets the column headers from the first row. <code>.Row</code> is numbered 1 to n-1, where n is the number of rows. You can override any of these defaults with optional parameters described below.
  
=== Indexes «columnIndex» and «rowIndex» ===
+
=== «columnIndex» and «rowIndex» ===
  
You can specify an existing index as a «columnIndex» or «rowIndex» to use instead of the default local indexes <code>.Column</code> or <code>.Row</code>. If you specify a «columnIndex» it treats the first row as data. If in fact, the first row contains headers, but you want to ignore them, perhaps because you are renaming some or all with «columnIndex», you should set «firstLineIsHeader» to <code>true</code>. It assumes the «columnIndex» contains the headers in the correct order.  If «columnIndex»  (or «rowIndex») is shorter than the number of columns (or rows) in the original, it ignores the extra columns or rows. If it is longer than expected, it pads the result with NULL for the extra columns (or rows.)
+
You can set «columnIndex» or «rowIndex» to use an existing index  instead of the default local indexes <code>.Column</code> or <code>.Row</code>. If you specify a «columnIndex» it treats the first row as data. If the first row contains headers, but you want to ignore them, perhaps because you are renaming some or all with «columnIndex», you should set «firstLineIsHeader» to <code>true</code>. It assumes the «columnIndex» contains the headers in the correct order.  If «columnIndex»  (or «rowIndex») is shorter than the number of columns (rows) in the original, it ignores the extra columns or rows. If an index is longer than expected, it pads the result with NULL for the extra columns (rows.)
  
=== If the header row is data: «firstLineIsHeader»a===
+
=== If the header row is data: «firstLineIsHeader» ===
  
If the CSV has no column headers in the first row, you should set «firstLineIsHeader» to <code>false</code>, and it will treat the first row as data.  If you don't specify a «columnIndex», the local index <code>.Column</code> will then be numbers 1 to M (the number of columns).   
+
If the CSV has no column headers in the first row, you should set «firstLineIsHeader» to <code>False</code>, and it will treat the first row as data.  If you don't specify a «columnIndex», it will use local index <code>.Column</code> containing numbers 1 to M (the number of columns).   
  
=== Select or reorder columns with «columnsToKeep» ===
+
=== Select, reorder, or rename columns with «columnsToKeep» ===
  
You can select just a single column from the CSV text by specifying «columnsToKeep» as a column name from the first line of the csv file, or as number of the column that you want . You should then omit «columnIndex».   
+
You can select just a single column from the CSV text by specifying «columnsToKeep» as a column name from the first line of the csv file, or as number of the column that you want. You should then omit «columnIndex».   
  
To extract several columns or to reorder columns, you should specify an existing index as the «columnIndex» with the headers you want. If it contains column names from the CSV (in any order), you can specify it also to «columnsToKeep», e.g.,  
+
You can also select several columns and/or reorder columns by setting «columnIndex» to an existing index containing only the headers you want in the order you want. If it contains column names from the CSV (in any order), you can specify it also to «columnsToKeep», e.g.,  
<code>[[ParseCSV]](Csv, Col, columnsToKeep: Col)</code>
+
Or if you want to rename the columns, you can pass «columnsToKeep» a 1D array indexed by «columnIndex» containing the column names or numbers in the CSV text.
+
  
If you pass a list or array to «columnsToKeep», without specifying «columnIndex», it will still extract the specified columns, but less efficiently because array abstraction repeats the call for each element in the array passed to «columnsToKeep», and it parses the «csvText» anew each time.  
+
code>[[ParseCSV]](Csv, Col, columnsToKeep: Col)</code>
 +
If you want to rename the columns, you can pass «columnsToKeep» a 1D array containing the column names or numbers from the CSV text indexed by «columnIndex» that contains the new names. (Some could be the same and some different.)
 +
 
 +
If you pass a list or array to «columnsToKeep» it's best to also specify «columnIndex». Without a «columnIndex» it will still extract the specified columns, but less efficiently because array abstraction repeats the call for each element in the array passed to «columnsToKeep», and it parses the «csvText» anew each time.
  
 
===  «rowIndexLabelsColumn» for row labels ===
 
===  «rowIndexLabelsColumn» for row labels ===
  
By default, the local index <code>.Row</code> contains numbers. But, you can use «rowIndexLabelsColumn» to specify a column number to use for the labels for the local row index.  For example, <code>ParseCSV(CsvText, rowIndexLabelsColumn: 1)</code> treats the first column as row index labels rather than as array cells. It removes the selected column from the result array unless you  include that column in «columnIndex» or «columnsToKeep».
+
By default, the local index <code>.Row</code> contains numbers 1 to the number of rows. But, you can also use a column to use for the labels for the local row index by setting «rowIndexLabelsColumn» to the number or label of that column.  For example,  
 +
 
 +
<code>ParseCSV(CsvText, rowIndexLabelsColumn: 1)</code>  
 +
 
 +
treats the first column as row index labels rather than as array cells. It removes the selected column from the result array unless you  include that column in «columnIndex» or «columnsToKeep».
  
 
=== Non-comma separator: «separator» ===
 
=== Non-comma separator: «separator» ===
  
As you would expect in a "comma-separated" format,, it assumes that cells (columns) are separated by commas! But, sometimes "CSV" files use a different separator, which you can specify with the «separator» parameter. Common separators include <code>'|'</code>, <code>[[Chr]](9)</code> (tab), <code>';'</code>
+
As you would expect in a "comma-separated" format, it assumes that cells (columns) are separated by commas! But, sometimes "CSV" files use a separator character, which you can set with the «separator» parameter. Common separators include <code>'|'</code>, <code>[[Chr]](9)</code> (tab), and <code>';'</code>
  
 
=== Spaces around values: «trimCells» ===
 
=== Spaces around values: «trimCells» ===
Line 44: Line 49:
 
By default it trims away any leading or trailing spaces around each value. Set «trimCells» to False if you want to retain those spaces. It never trims spaces inside quotes.
 
By default it trims away any leading or trailing spaces around each value. Set «trimCells» to False if you want to retain those spaces. It never trims spaces inside quotes.
  
=== «parseFlags» ===
+
=== «parseFlags» for interpreting numbers and dates ===
  
 
By default ParseCSV converts each number, whether enclosed in quotes or not,  into a number or date, following Excel conventions.  
 
By default ParseCSV converts each number, whether enclosed in quotes or not,  into a number or date, following Excel conventions.  
«parseFlags» is a bit-field of flags to control whether it should try to convert cells into a number or date.  You can add flags together to combine their effects:   
+
«parseFlags» is a bit-field of flags that lets you change this behavior.  You can add flags together to combine their effects:   
 
* 0 = Parse both quoted ("52") and unquoted (52) cells as numbers.
 
* 0 = Parse both quoted ("52") and unquoted (52) cells as numbers.
* 1 = Don't parse any cells. E.g., return <code>="52"</code>, <code>"52"</code> and <code>52</code> as text.
+
* 1 = Don't parse any cells -- e.g. return <code>="52"</code>, <code>"52"</code> and <code>52</code> as text.
* 2 = Don't parse quoted cells. E.g., <code>"52"</code> is text, but <code>52</code> is numeric.
+
* 2 = Don't parse quoted cells -- e.g. treat <code>"52"</code> as text, but <code>52</code> as a number.
 
* 4 = Disable the '=' prefix (<code>="0042"</code> normally suppresses parsing).
 
* 4 = Disable the '=' prefix (<code>="0042"</code> normally suppresses parsing).
* 8 = Recognize backslash-escaped quotes to allow quotes as valid text inside a quote (in addition to the doubling of quotes, which is the usual CSV standard).
+
* 8 = Recognize backslash-escaped quotes to allow quotes inside a quote (in addition to the doubling of quotes, which is the usual CSV standard). For example "He said \"Hello\"." would return the text "He said "Hello"."
 
* 16 = Disable generalized date parsing.  When you do this, it still recognizes format <code>6-May-2016</code> as a date but not other formats. This flag dramatically speeds up parsing times for data with no other date formats.
 
* 16 = Disable generalized date parsing.  When you do this, it still recognizes format <code>6-May-2016</code> as a date but not other formats. This flag dramatically speeds up parsing times for data with no other date formats.
  
Line 63: Line 68:
 
=== «decimal» point and thousands separators ===
 
=== «decimal» point and thousands separators ===
 
   
 
   
By default, it assumes that <code>'.'</code> (dot or period) separates the whole number from the fractional part of a decimal number -- i.e. the convention in English-speaking countries. You can tell it to use <code>','</code> (comma) by setting the «decimal» parameter. That parameter accepts only <code>'.'</code> (dot) or <code>','</code> (comma).
+
By default, it formats large numbers with decimals as <code>"525,948.77"</code> using  <code>','</code>comma to separate thousands and <code>'.'</code> (dot) to separate the whole and decimal part -- i.e. the convention in English-speaking countries. You can get to use the opposite convention used in the rest of the World by setting the «decimal» parameter to  <code>','</code> (comma) instead of the default <code>'.'</code>.  
 
+
When comma is used for «decimal», the «separator» is usually  something other than comma -- commonly semi-colon <code>';'</code>. If the «decimal» and «separator» are both commas, there must be quotes around any numbers containing decimals.
+
  
When «decimal» is '.', you can use commas to group digits, such as <code>"525,948.77"</code>. Conversely, when «decimal» is comma, you can use '.' to group digits, such as <code>"525.948,77"</code> (for five hundred twenty-five thousand, nine hundred eight and 77 hundredths).  
+
When using comma for «decimal», it's usual to specify «separator» as semi-colon <code>';'</code> or something else other than comma. If «decimal» and «separator» are both commas, you need to use quotes around any numbers containing decimals.
  
 
=== «quote» character ===
 
=== «quote» character ===

Latest revision as of 02:37, 7 November 2019


new to Analytica 5.0

Converts a text value read in from a file with CSV (Comma-Separated Values) format into a 2D array. By default, it converts values that appear to be numbers and dates into Analytica numbers and dates. It offers lots of optional parameters to control the column and row indexes, and handle the many variants of the CSV format -- for example, with nonstandard characters that separate columns, rows, and for the quotes around text.

ParseCSV( csvText, columnIndex, rowIndex, separator, firstLineIsHeader, trimCells, columnsToKeep, rowIndexLabelColumn, parseFlags, dateTemplate, decimal, quote )

Converts text «csvText» from a CSV (Comma-Separated Values) format into a two-dimensional array. Usually, you use it with ReadTextFile() to read the file:

ParseCSV(ReadTextFile( filename ))

By default, it assumes the usual CSV conventions (like Excel) as defaults: comma to separate fields, newline to separate rows, and double quotes to enclose elements that may contain commas or newlines. The resulting array has local indexes, .Column and .Row. Index .Column gets the column headers from the first row. .Row is numbered 1 to n-1, where n is the number of rows. You can override any of these defaults with optional parameters described below.

«columnIndex» and «rowIndex»

You can set «columnIndex» or «rowIndex» to use an existing index instead of the default local indexes .Column or .Row. If you specify a «columnIndex» it treats the first row as data. If the first row contains headers, but you want to ignore them, perhaps because you are renaming some or all with «columnIndex», you should set «firstLineIsHeader» to true. It assumes the «columnIndex» contains the headers in the correct order. If «columnIndex» (or «rowIndex») is shorter than the number of columns (rows) in the original, it ignores the extra columns or rows. If an index is longer than expected, it pads the result with NULL for the extra columns (rows.)

If the header row is data: «firstLineIsHeader»

If the CSV has no column headers in the first row, you should set «firstLineIsHeader» to False, and it will treat the first row as data. If you don't specify a «columnIndex», it will use local index .Column containing numbers 1 to M (the number of columns).

Select, reorder, or rename columns with «columnsToKeep»

You can select just a single column from the CSV text by specifying «columnsToKeep» as a column name from the first line of the csv file, or as number of the column that you want. You should then omit «columnIndex».

You can also select several columns and/or reorder columns by setting «columnIndex» to an existing index containing only the headers you want in the order you want. If it contains column names from the CSV (in any order), you can specify it also to «columnsToKeep», e.g.,

code>ParseCSV(Csv, Col, columnsToKeep: Col)</code> If you want to rename the columns, you can pass «columnsToKeep» a 1D array containing the column names or numbers from the CSV text indexed by «columnIndex» that contains the new names. (Some could be the same and some different.)

If you pass a list or array to «columnsToKeep» it's best to also specify «columnIndex». Without a «columnIndex» it will still extract the specified columns, but less efficiently because array abstraction repeats the call for each element in the array passed to «columnsToKeep», and it parses the «csvText» anew each time.

«rowIndexLabelsColumn» for row labels

By default, the local index .Row contains numbers 1 to the number of rows. But, you can also use a column to use for the labels for the local row index by setting «rowIndexLabelsColumn» to the number or label of that column. For example,

ParseCSV(CsvText, rowIndexLabelsColumn: 1)

treats the first column as row index labels rather than as array cells. It removes the selected column from the result array unless you include that column in «columnIndex» or «columnsToKeep».

Non-comma separator: «separator»

As you would expect in a "comma-separated" format, it assumes that cells (columns) are separated by commas! But, sometimes "CSV" files use a separator character, which you can set with the «separator» parameter. Common separators include '|', Chr(9) (tab), and ';'

Spaces around values: «trimCells»

By default it trims away any leading or trailing spaces around each value. Set «trimCells» to False if you want to retain those spaces. It never trims spaces inside quotes.

«parseFlags» for interpreting numbers and dates

By default ParseCSV converts each number, whether enclosed in quotes or not, into a number or date, following Excel conventions. «parseFlags» is a bit-field of flags that lets you change this behavior. You can add flags together to combine their effects:

  • 0 = Parse both quoted ("52") and unquoted (52) cells as numbers.
  • 1 = Don't parse any cells -- e.g. return ="52", "52" and 52 as text.
  • 2 = Don't parse quoted cells -- e.g. treat "52" as text, but 52 as a number.
  • 4 = Disable the '=' prefix (="0042" normally suppresses parsing).
  • 8 = Recognize backslash-escaped quotes to allow quotes inside a quote (in addition to the doubling of quotes, which is the usual CSV standard). For example "He said \"Hello\"." would return the text "He said "Hello"."
  • 16 = Disable generalized date parsing. When you do this, it still recognizes format 6-May-2016 as a date but not other formats. This flag dramatically speeds up parsing times for data with no other date formats.

«dateTemplate» Date Formats

Use the «dateTemplate» parameter to specify the ordering for international dates. Use the letters "d", "M" and "y" to specify the ordering of these components. For example, with «dateTemplate» of "d/M/y", parses "11/10/9" as 11-Oct-2009, but with «dateTemplate» of "y/M/d" it parses it as 9-Oct-2011, and with "M/d/y" it would be 10-Nov-2009.

This parameter uses the same conventions used in the Custom date formats template in the Number Format Dialog, but only the relative ordering of the day, month and year patterns matter. Hence, all these work equivalently: "dMy", "dd/MM/yyyy", "d-M-yy", "d M yyyy". Also, upper/lower case matters! "M" must be uppercase (because "m" means minutes").

«decimal» point and thousands separators

By default, it formats large numbers with decimals as "525,948.77" using ','comma to separate thousands and '.' (dot) to separate the whole and decimal part -- i.e. the convention in English-speaking countries. You can get to use the opposite convention used in the rest of the World by setting the «decimal» parameter to ',' (comma) instead of the default '.'.

When using comma for «decimal», it's usual to specify «separator» as semi-colon ';' or something else other than comma. If «decimal» and «separator» are both commas, you need to use quotes around any numbers containing decimals.

«quote» character

The quote character is used in CSV to enclose numbers or text that may contain commas, new lines, or other separators. The default quote character is double quote ("). You can specify a different quote character using the «quote» parameter. The most common other quote character is the single quote ('). To type a single quote in Analytica, you can type doubleQuote SingleQuote doubleQuote ("'"). Analytica lets you use either matching single quotes or matching double quotes to enclose a text string.

The CSV format

There is no single well-defined CSV standard. Undocumented conventions used by Excel are often seen as the most definitive "specification". The article at Comma Separated Values (CSV) Standard File Format is relatively easy to read and covers the basics well, without being the most comprehensive reference.

Each "record" (or row) is a CSV file is delineated by a newline, either CR, LF or CRLF. Each value within a row is separated from the next field by a separator character, usually comma (hence the name "Comma-separated values"), although other characters or character sequences are also sometimes used, including commonly TAB (Chr(9)), bar ('|'), semi-colon (';') and even space.

A row of data might look like this:

California, CA, "39,144,818", 163696, Sacramento, "Edmund Gerald ""Jerry"" Brown, Jr.", ="95814", 9/9/1850

Both textual and numeric values may or may not be surrounded by quotes. Any value containing the separator character or a newline must be surrounded by quotes. Quotes are always the double-quote character ("). A value may also be a date (e.g., 9/9/1850). Spaces to the left or right of a value are normally trimmed (unless you set «textTrim» to false).

When a value is read, if it parses as a valid date, it will be parsed as a date, otherwise if it parses as a valid number, it will parse as a number, and otherwise it will be read as text. The «parseFlags» parameter can be used to alter this behavior somewhat. To force a value to be read as text, even when it would be a valid number, use the equal-quote form, ="95814".

Note that a new lines can appear within a quoted value, so simply splitting CSV data on newlines does not reliably separate the data into rows.

The first line of a CSV file often consists of columns names, with data starting on the second row. But this is not always the case, some CSV files have no column headers with data starting on the first line.

Examples

The simplest usage is:

ParseCSV( csvText )

which follows the most common format and Excel standards.

To read in a CSV file:

ParseCSV( ReadTextFile( filename ) )


History

This built-in function was introduced in Analytica 5.0. For earlier releases, the Flat-file library contains ReadCsvFile and ParseCsvText functions. ParseCSV is a substantial improvement on these.


See Also

Comments


You are not allowed to post comments.