File Based Data

Overview

Knowi enables analysis, visualization, warehousing and reporting automation from file based data across the following file formats & Channels:

  • CSV
  • JSON
  • Logs
  • XML
  • Excel
  • Files over S3, FTP, HTTP, Email

Note, that data from these sources will be created as widgets directly.

Use our agent inside your network to connect and extract data from various file based data sources securely. See Cloud9Agent to download your agent along with instructions to run it. Under the examples folder, the agent contains datasource_example_csv.json and query_example_csv.json to get you started.

  • Edit those to point to your directory and modify the queries to pull your data.
  • Move it into the config directory (datasource_XXX.json files first if the Agent is running).

CSV/Delimited files

Highlights:

  • Execute queries on the file when the file is modified, one time or on a schedule.
  • Extract/Cleanse and query file based data using Cloud9QL.

Datasource Configuration:

Parameter Comments
name Unique Datasource Name.
datasource Set value to csv
url File path. Example: /usr/local/examples/demoFile.txt
quoteChar Quote Character for String values, if applicable. Defaults to double quotes if not specified.
delimChar Delimiter character. Defaults to comma if not specified
nullFindString Optional, searches for fields to replace with nullReplaceWith. Useful to replace values such as 'NA', 'None' etc in a number field
nullReplaceWith Optional, Replaces the string specified in nullFindString with this
columnTypes Nested JSON that defines datatypes for fields. Defaults to String if a type is not specified. Datatypes supported: String, Date(), Number, double. See example and the Date Format section.

Date Formats for Date Fields in columnTypes:

Letter Date/Time Component
y Year
M Month
w Week of Year
W Week in month
D Day in Year
d Day in Month
F Day of Week in Month
a AM/PM marker
H Hour in day (0-23)
h Hour in am/pm (1-12)
m Minute in hour
s Second in minute
S Millisecond
z Time zone
Z Time zone

Example:

  ...
  "columnTypes":{
     "Week":"Date(MM/dd/yy)",
     "Date":"Date(MM/dd/yy HH:mm)"
  }
  ...
 

Query Configuration:

Query Config Params Comments
entityName Dataset Name Identifier
identifier A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr Cloud9QL query to extract and manipulate data using. Use select * to select all. See Cloud9QL docs
runOnSourceChange The query is run when the file has been modified, if set to true. Possible values: true or false.
frequencyType Alternative to runOnSourceChange, pick up the file on a schedule. One of minutes, hours, days,weeks,months. If this is not specified, this is treated as a one time query, executed upon Cloud9Agent startup (or when the query is first saved)
frequency Indicates the frequency, if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes
startTime Optional, can be used to specify when the query should be run for the first time. If set, the the frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
overrideVals This enables data storage strategies to be specified. If this is not defined, the results of the query is added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the the current data and insert records where they are not present.

Datasource Example:

[
  {
    "name":"demoCSV",
    "url":"/usr/local/examples/demo-data.csv",
    "datasource":"csv",
    "quoteChar":"\"",
    "delimChar":",",
    "eolSymbols":"\n",
    "nullFindString":"None",
    "nullReplaceWith":"0",
    "columnTypes":{
      "Week":"Date(MM/dd/yy)",
      "Date":"Date(MM/dd/yy HH:mm)"
    }
  }
]

Query Example:

[
  {
    "entityName":"CSV Demo",
    "dsName":"demoCSVFile",
    "queryStr":"select count(*) where someField like abc",
    "runOnSourceChange":true,
    "overrideVals":{
      "replaceAll":true
    }
  }
]

JSON

Highlights:

  • Extract data from JSON documents
  • Extract/Cleanse Query data using Cloud9QL, including nested documents.
  • Trigger query execution once, on a schedule or when a file is modified.

Datasource Configuration:

Parameter Comments
name Unique Datasource Name.
datasource Set value to json
url File path. Example: /usr/local/examples/demoFile.json

Query Configuration:

.
Query Config Params Comments
entityName Dataset Name Identifier
identifier A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr Cloud9QL query to extract and manipulate data using. Use select * to select all. See Cloud9QL docs. Nested queries and objects are supported.
runOnSourceChange Optional. The query is run when the file has been modified, if set to true. Possible values: true or false.
frequencyType Alternative to runOnSourceChange, pick up the file on a schedule. One of minutes, hours, days,weeks,months. If this is not specified, this is treated as a one time query, executed upon Cloud9Agent startup (or when the query is first saved)
frequency Indicates the frequency, if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes
startTime Optional, can be used to specify when the query should be run for the first time. If set, the the frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
overrideVals This enables data storage strategies to be specified. If this is not defined, the results of the query is added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the the current data and insert records where they are not present.

Datasource Example:

[
  {
    "name":"demoJSON",
    "url":"examples/demo-data.json",
    "datasource":"json"
  },
  {
    "name":"demoJSON-Nested",
    "url":"examples/demo-nested.json",
    "datasource":"json"
  }

]

Query Example:

[
  {
    "entityName":"JSON Demo - 1",
    "dsName":"demoJSON",
    /* Using Cloud9SQL to query JSON file */
    "queryStr":"select sum(sent) as Sent, customer as Customer group by Customer",
    "runOnSourceChange":true,
    "overrideVals":{
      "replaceAll":true
    }
  },
  {
    "entityName":"JSON Nested Demo",
    "dsName":"demoJSON-Nested",
    "queryStr":"select nestedObj.a, nestedArr[0], sent where nestedArr[0]=150",
    "runOnSourceChange":true,
    "overrideVals":{
      "replaceAll":true
    }
  }
]

Log Files

Highlights:

  • Extract data from Logs using a regex based parser
  • Extract/Cleanse Query data using Cloud9QL
  • Trigger query execution once, on a schedule or when a file is modified.

Note: The examples folder of the agent contains datasourceexampleregex.json and queryexampleregex.json sample configurations.

Datasource Configuration:

Parameter Comments
name Unique Datasource Name.
datasource Set value to regex
url File path. Example: /usr/local/examples/demoFile.json
logFormat Format of the log entries to process.

 Example: ${date:DATE(MMM d HH:mm:ss)}${host:WORD()}\\S+:${msg} 
 Given the following log entries
 Jan 25 04:02:02 abhf sudo: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/usr/bin/dk
 Jan 25 04:02:02 ldklf sshd[27099]: pddam_unix(sshd:session): session closed for user admin

 The results will be three fields:
    Date: parsed in the date format
    host: a host field, parsed after the date
    msg: Rest of the log entry.

Query Configuration:

.
Query Config Params Comments
entityName Dataset Name Identifier
identifier A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr Cloud9QL query to extract and manipulate data using. Use select * to select all. See Cloud9QL docs. Nested queries and objects are supported.
runOnSourceChange Optional. The query is run when the file has been modified, if set to true. Possible values: true or false.
frequencyType Alternative to runOnSourceChange, pick up the file on a schedule. One of minutes, hours, days,weeks,months. If this is not specified, this is treated as a one time query, executed upon Cloud9Agent startup (or when the query is first saved)
frequency Indicates the frequency, if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes
startTime Optional, can be used to specify when the query should be run for the first time. If set, the the frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
overrideVals This enables data storage strategies to be specified. If this is not defined, the results of the query is added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the the current data and insert records where they are not present.

Datasource Example:

[
    {
        "name":"demoRegex",
        "url":"examples/demo-regex.log",
        "datasource":"regex",
        "logFormat":"${date:DATE(MMM d HH:mm:ss)}${host:WORD()}\\S+:${msg}"
    }
]

Query Example:

[
    {
        "entityName":"Sample Regex Query",
        "dsName":"demoRegex",
        "queryStr":"select date(date) as date, count(*) as count group by date(date)",
        "runOnSourceChange":true,
        "overrideVals":{
            "replaceAll":true
        }
    }
]

XML Files

Highlights:

  • Extract data from XML files
  • Traverse/Extract/Cleanse/Aggregate data using Cloud9QL
  • Trigger query execution once, on a schedule or when a file is modified.

Note: The examples folder of the agent contains datasourceexamplexml.json and queryexamplexml.json sample configurations.

Datasource Configuration:

Parameter Comments
name Unique Datasource Name.
datasource Set value to xml
url File path. Example: /usr/local/examples/demoFile.txt
dataNode The base node to traverse from
columnTypes Nested JSON that defines datatypes for fields. For nested elements, use the dot notation. Example: Count.Sent
Defaults to String if a type is not specified. Datatypes supported: String, Date(), Number, double. See example and the Date Format section.

Date Formats for Date Fields in columnTypes:

Letter Date/Time Component
y Year
M Month
w Week of Year
W Week in month
D Day in Year
d Day in Month
F Day of Week in Month
a AM/PM marker
H Hour in day (0-23)
h Hour in am/pm (1-12)
m Minute in hour
s Second in minute
S Millisecond
z Time zone
Z Time zone

Example:

  ...
  "columnTypes":{
     "Week":"Date(MM/dd/yy)",
     "Date":"Date(MM/dd/yy HH:mm)",
     "Count.Sent":"Integer"
  }
  ...
 

Query Configuration:

Query Config Params Comments
entityName Dataset Name Identifier
identifier A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr Cloud9QL query to extract and manipulate data using. Use select * to select all. See Cloud9QL docs
runOnSourceChange The query is run when the file has been modified, if set to true. Possible values: true or false.
frequencyType Alternative to runOnSourceChange, pick up the file on a schedule. One of minutes, hours, days,weeks,months. If this is not specified, this is treated as a one time query, executed upon Cloud9Agent startup (or when the query is first saved)
frequency Indicates the frequency, if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes
startTime Optional, can be used to specify when the query should be run for the first time. If set, the the frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
overrideVals This enables data storage strategies to be specified. If this is not defined, the results of the query is added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the the current data and insert records where they are not present.

Datasource Example:

[
    {
        "name":"demoXML",
        "url":"examples/demo-data.xml",
        "datasource":"xml",
        "dataNode":"data.item",
        "columnTypes":{
            "id":"Integer",
            "Week":"Date(MM/dd/yy)",
            "Date":"Date(MM/dd/yy HH:mm)",
            "Count.Sent":"Integer"
        }
    }
]

Query Example:

[
    {
        "entityName":"XML Demo - 1",
        "dsName":"demoXML",
        "queryStr":"select *",
        "runOnSourceChange":true,
        "overrideVals":{
            "replaceAll":true
        }
    },
    {
        "entityName":"XML Demo - 2",
        "dsName":"demoXML",
        "queryStr":"select sum(Count.Sent) as Sent, Customer where recipients.recipient[0] = c001@cloud9charts.com",
        "runOnSourceChange":true,
        "overrideVals":{
            "replaceAll":true
        }
    }
]

The first query returns the entire document as is with nested elements intact. The second traverses nested relationships to build a simple aggregation, based on specific conditions.


FTP

Highlights:

  • Extends the CSV/Delimiter functionality to connect and pull files from an FTP site.
  • Can be set to only process if the file is different than the last pulled file.

Datasource and Query configurations are the same as the CSV/JSON processing, except for a customProcessors attribute that enables FTP.

Example CSV Datasource using FTP to pull file:

[
    {
   "name":"pageViewsLogs",
   "url":"/export/home/akjd/file.csv",
   "datasource":"csv",
   "quoteChar":"\"",
   "delimChar":"|",
   "eolSymbols":"\n",
   "nullFindString":"None",
   "nullReplaceWith":"0",
   "columnTypes":{
     "startCount":"int",
     "endCount":"int"
   },
   "customProcessors":[
     {
       "className":"com.cloud9.connector.processors.FTPProcessor",
       "processOrder":"timed",
       "frequencyType":"days",
       "frequency":1,
       "startTime":"03:15",
       "params":{
         "host":"ftp.box.com",
         "user":"box@cloud9charts.com",
         "password":"somePass",
         "remoteFile":"/BoxDir/\${date:yyyyMMdd}/page_views_\${date:yyyyMMdd-1d}000000_\${date:yyyyMMdd}000000.csv",
         "localFile":"/export/home/akjd/file.csv",
         "overwriteOnlyWhenChanged":true
       }
     }
   ]
  }
]

FTP Processor params configuration:

FTP Processor Parameter Comments
host FTP Host to connect to
port Port number, optional
user user to connect to the FTP site. Connects as anonymous if not specified
password Password to connect. Connects as anonymous if user is not specified
mode Optional security mode. Default is empty. Values: ftps or ftpes
remoteFile Remote File Name. For files with dates in them, use \${date:dateFormat} to for current date. You can also add or subtract days or weeks. Example: \${date:yyyyMMdd-1d}
localFile Where to put the file pulled from the FTP site. Typically, this is set to the same as url parameter in the main datasource configuration
overwriteOnlyWhenChanged If set to true, overwrites local copy only if the remote file is different than the local copy.


Email Attachments

This can be used to turn existing text reports sent via email into dashboards widgets and dashboards, with storage and trends.

Highlights:

  • Visualize, transform and track your existing CSV/delimited email reports.
  • CC existing email report to us.

Contact us for the mailbox to CC your report into and we'll take care of the rest.


Excel Files

Highlights:

  • Process Excel files when the file is modified, on a schedule, or one-time.
  • Extract/Cleanse data from a Worksheet using Cloud9QL.

Datasource Configuration:

Parameter Comments
name Unique Datasource Name.
datasource Set value to excel
url File path. Example: /usr/local/examples/demoFile.txt

Query Configuration:

Query Config Params Comments
entityName Dataset Name Identifier
identifier A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr Cloud9QL query to extract and manipulate data from the worksheet. Use select * to select everything from the worksheet. See Cloud9QL docs
excelWorksheetName Worksheet name to process data from
runOnSourceChange The query is run when the file has been modified, if set to true. Possible values: true or false.
frequencyType Alternative to runOnSourceChange, pick up the file on a schedule. One of minutes, hours, days,weeks,months. If this is not specified, this is treated as a one time query, executed upon Cloud9Agent startup (or when the query is first saved)
frequency Indicates the frequency, if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes
startTime Optional, can be used to specify when the query should be run for the first time. If set, the the frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
overrideVals This enables data storage strategies to be specified. If this is not defined, the results of the query is added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the the current data and insert records where they are not present.

Datasource Example:

[
  {
    "name":"demoExcel",
    "url":"/usr/local/abc/files/demo-data.xlsx",
    "datasource":"excel"
  }
]

Query Example:

[
  {
    "entityName":"Excel Data - Demo Sent",
    "dsName":"demoExcel",
    "queryStr":"select sum(sent) as Sent, Customer group by customer",
    "runOnSourceChange":true,
    "excelWorksheetName":"demo-data",
    "overrideVals":{
      "replaceAll":true
    }
  }
]

HTTP File Downloads

Highlights:

  • Download and process CSV/JSON files via HTTP.
  • Extract/manipulate data using Cloud9QL.
  • Pull on a schedule, or once, or process if the file has been modified since last run.

Datasource Example:

[
  {
    "name":"demoHttpPublic",
    "url":"https://s3.amazonaws.com/datatesting/demo-data-public.csv",
    "datasource":"http",
    "dataType":"csv",
    "localFile":"./downloadedFile.csv"
  }
]

Query Example:

[
  {
    "entityName":"HTTP File Demo",
    "dsName":"demoHttpPublic",
    "queryStr":"select date(date) as Date, sum(sent) as Total Opened, sum(opened) as Opened group by date(date)",
    "overrideVals":{
      "replaceAll":true
    }
  }
]

Amazon S3

Highlights:

  • Process CSV/JSON files stored in S3.
  • Extract/manipulate data using Cloud9QL.
  • Pull on a schedule, once, or process if the file has been modified after last run.
  • Process File names with dates dynamically based on the current date.

Authorization

For files that require authentication/authorization, create credentials from AWS:

  1. Login to the AWS IAM console:
  2. Click on 'Create a New group of Users', enter a group name. Select 'Read Only Access'
  3. Enable "s3:Get*" in the Policy Document, remove all others:
    S3 Permissions
  4. Click on 'Create New Users'. Create a user name with 'Generate an access key for each user' checked.
  5. Once created, use the show credentials to note down the credentials for the user.

Datasource Configuration:

Parameter Comments
name Unique Datasource Name.
datasource Set value to s3
s3RegionName Region associated to the bucket. Example: us-east-1
s3BucketName S3 Bucket Name/Identifier
s3FileKey S3 File Name
datatype File format. Supported: json/csv
s3FileKey S3 File Name
datatype File format. Supported: json/csv. Defaults to csv if not specified.
localFile Location and name of the local file upon download. Optional - if not specified, it'll pull same name as the S3 key into the base agent directory.
authAccessToken S3 Credentials key obtained for the user during the auth step
authAccessTokenSecret S3 Credentials secret obtained for the user using the auth step

Query Configuration:

Query Config Params Comments
entityName Dataset Name Identifier
identifier A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr Cloud9QL query to extract and manipulate data from the worksheet. Use select * to select everything from the worksheet. See Cloud9QL docs
runOnSourceChange The query is run when the file has been modified, if set to true. Possible values: true or false.
frequencyType Alternative to runOnSourceChange, pick up the file on a schedule. One of minutes, hours, days,weeks,months. If this is not specified, this is treated as a one time query, executed upon Cloud9Agent startup (or when the query is first saved).
frequency Indicates the frequency, if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes.
startTime Optional, can be used to specify when the query should be run for the first time. If set, the the frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
overrideVals This enables data storage strategies to be specified. If this is not defined, the results of the query is added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the the current data and insert records where they are not present.

Datasource Example:

[
  {
    "name":"demoS3CSV",
    "datasource":"s3",
    "s3RegionName":"us-east-1",
    "s3BucketName":"datatesting",
    "s3FileKey":"demo-data.csv",
    "dataType":"csv",
    "authAccessToken":"someToken",
    "authAccessTokenSecret":"someSecret",
    "localFile":"./downloadedFile.csv"
  },
  {
    "name":"demoS3JSON",
    "datasource":"s3",
    "s3RegionName":"us-east-1",
    "s3BucketName":"datatesting",
    "s3FileKey":"demo-data.json",
    "dataType":"json",
    "authAccessToken":"someToken",
    "authAccessTokenSecret":"someSecret"
  }
]

For processing s3BucketName or s3FileKey that has a date based name, use ${date: <dateFormat>} to process based on current date.

Examples:

i. If the current date is 08/01/2014 and today's data is identified by 20140801-file.csv in S3, this can be defined as ${date:yyyyMMdd}-file.csv for processing.

ii. If the current date is 08/01/2014 and the file has multiple days worth of data, identified by 20140801-20140727-file.csv, define it as ${date:yyyyMMdd}-${date:yyyyMMdd-5d}-file.csv

Query Example:

[
  {
    "entityName":"S3 CSV",
    "dsName":"demoS3CSV",
    "queryStr":"select * limit 10",
    "overrideVals":{
      "replaceAll":true
    }
  },
  {
    "entityName":"S3 JSON",
    "dsName":"demoS3JSON",
    "queryStr":"select * where sent > 10000",
    "overrideVals":{
      "replaceAll":true
    }
  }
]

Public Files

For public files on S3 that does not require S3 credentials, see our HTTP File Downloads section