Read a csv file in sparkR where columns have spaces

Question

Normally, when we read a csv file in R, the spaces are automatically converted to '.'

> df <- read.csv("report.csv")
> str(df)
'data.frame':   598 obs. of  61 variables:
 $ LR.Number   
 $ Vehicle.Number

However, when we read the same csv file in sparkR, the space remains intact and is not handled implicitly by spark

#To read a csv file
df <- read.df(sqlContext, path = "report.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
printSchema(df)

root
 |-- LR Number: string (nullable = true)
 |-- Vehicle Number: string (nullable = true)

Because of this, performing any activity with the column causes a lot of trouble and need to be call like this

head(select(df, df$`LR Number`))

How can I explicitly handle this? How can sparkR implicitly handle this.

I am using sparkR 1.5.0 version

not sure if this will help, as delimiter means how each columns are separated. Here columns are "," separated and each column has a space. So I have col1,col2 where col1 is 'name1 name2' , col2 is 'name1 name2' — Hardik Gupta, Dec 16 '16 at 11:26
this feature in normal R gets automatically handled, try reading a csv file which has column names with spaces, there R automatically inserts a dot (.) — Hardik Gupta, Dec 16 '16 at 11:27
Set `header = "true"` and `inferSchema = "false"` to skip the names to have it use built-in ones or `selectExpr()` which supports using `col_name AS new_col_name` as seen in this python example: http://stackoverflow.com/a/34077809/1457051 (which shld be straightforward to extrapolate from). I'm running `sparklyr` with spark 2.x when I use spark so I'm not keen to test `sparkR` with an older spark version. You can also use `sql()` to import the CSV with SQL as shown in the "SQL" section of https://github.com/databricks/spark-csv (you can change col names that way in the `CREATE TABLE` call). — hrbrmstr, Dec 16 '16 at 12:01
then you're stuck at the SQL level or potentially upgrading spark & sparkR to see if newer builds pass on the parameters. If you aren't using sparkR for the ML components but just for the "big data" capabilities take a look at Apache Drill. If all else fails, you can load the CSV in via spark SQL outside of R. — hrbrmstr, Dec 16 '16 at 13:57
upgrading spark is no option.. I have to live with 1.5.0 only (since am using cloudera cdh 1.5.0 as well ) — Hardik Gupta, Dec 16 '16 at 14:15

score 1 · Answer 1 · answered Dec 16 '16 at 16:01

1

As a work around you could use the following piece of psuedo code

colnames_df<-colnames(df)
colnames_df<-gsub(" ","_",colnames_df)

colnames(df)<-colnames_df

Another solution is to save file somewhere and read using read.df()

answered Dec 16 '16 at 16:01

Mohit Bansal

131
8

function `colnames(df)` returns NULL :( – Hardik Gupta Dec 19 '16 at 07:23

score 1 · Accepted Answer · answered Dec 19 '16 at 07:34

Following worked for me

df = collect(df)
colnames_df<-colnames(df)
colnames_df<-gsub(" ","_",colnames_df)
colnames(df)<-colnames_df
df <- createDataFrame(sqlContext, df)
printSchema(df)

Here we need to locally collect the data first, which will convert spark data frame to normal R data frame. I am sceptical whether this is a good solution as I don't want to call collect. However I investigated and found that even to use ggplot libraries we need to convert this into a local data frame

Read a csv file in sparkR where columns have spaces

2 Answers2