Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, …, fn: Double)]












3















I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using



myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")


but not for an arbitrary amount of features. Is there an easy way to do this?



Example:



val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]


EDIT



I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:



finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)









share|improve this question





























    3















    I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using



    myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")


    but not for an arbitrary amount of features. Is there an easy way to do this?



    Example:



    val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
    val myColumnNames = List("f1", "f2", "f3")
    // val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]


    EDIT



    I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:



    finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)









    share|improve this question



























      3












      3








      3


      1






      I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using



      myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")


      but not for an arbitrary amount of features. Is there an easy way to do this?



      Example:



      val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
      val myColumnNames = List("f1", "f2", "f3")
      // val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]


      EDIT



      I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:



      finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)









      share|improve this question
















      I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using



      myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")


      but not for an arbitrary amount of features. Is there an easy way to do this?



      Example:



      val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
      val myColumnNames = List("f1", "f2", "f3")
      // val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]


      EDIT



      I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:



      finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)






      scala apache-spark apache-spark-sql apache-spark-ml






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 20 at 22:47









      user10465355

      1,9012416




      1,9012416










      asked Jun 29 '16 at 21:06









      mt88mt88

      71541331




      71541331
























          2 Answers
          2






          active

          oldest

          votes


















          14














          One possible approach is something similar to this



          import org.apache.spark.sql.functions.udf
          import org.apache.spark.mllib.linalg.Vector

          // Get size of the vector
          val n = testDF.first.getAs[org.apache.spark.mllib.linalg.Vector](0).size

          // Simple helper to convert vector to array<double>
          val vecToSeq = udf((v: Vector) => v.toArray)

          // Prepare a list of columns to create
          val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))

          testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)


          If you know a list of columns upfront you can simplify this a little:



          val cols: Seq[String] = ???
          val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }


          For Python equivalent see How to split Vector into columns - using PySpark.






          share|improve this answer

































            2














            Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:



            import org.apache.spark.ml.feature.VectorAssembler
            import org.apache.spark.ml.linalg.Vectors

            val dataset = spark.createDataFrame(
            Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
            ).toDF("id", "val1", "val2")


            val assembler = new VectorAssembler()
            .setInputCols(Array("val1", "val2"))
            .setOutputCol("vectorCol")

            val output = assembler.transform(dataset)
            output.show()
            /*
            +---+----+----+---------+
            | id|val1|val2|vectorCol|
            +---+----+----+---------+
            | 0| 1.2| 1.3|[1.2,1.3]|
            | 1| 2.2| 2.3|[2.2,2.3]|
            | 2| 3.2| 3.3|[3.2,3.3]|
            +---+----+----+---------+*/

            val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
            .setInputCol("vectorCol")
            disassembler.transform(output).show()
            /*
            +---+----+----+---------+----+----+
            | id|val1|val2|vectorCol|val1|val2|
            +---+----+----+---------+----+----+
            | 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
            | 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
            | 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
            +---+----+----+---------+----+----+*/





            share|improve this answer



















            • 1





              VectorDisassembler never got into Spark (SPARK-13610).

              – hi-zir
              May 9 '18 at 18:41











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f38110038%2fspark-scala-how-to-convert-dataframevector-to-dataframef1double-fn-d%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            14














            One possible approach is something similar to this



            import org.apache.spark.sql.functions.udf
            import org.apache.spark.mllib.linalg.Vector

            // Get size of the vector
            val n = testDF.first.getAs[org.apache.spark.mllib.linalg.Vector](0).size

            // Simple helper to convert vector to array<double>
            val vecToSeq = udf((v: Vector) => v.toArray)

            // Prepare a list of columns to create
            val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))

            testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)


            If you know a list of columns upfront you can simplify this a little:



            val cols: Seq[String] = ???
            val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }


            For Python equivalent see How to split Vector into columns - using PySpark.






            share|improve this answer






























              14














              One possible approach is something similar to this



              import org.apache.spark.sql.functions.udf
              import org.apache.spark.mllib.linalg.Vector

              // Get size of the vector
              val n = testDF.first.getAs[org.apache.spark.mllib.linalg.Vector](0).size

              // Simple helper to convert vector to array<double>
              val vecToSeq = udf((v: Vector) => v.toArray)

              // Prepare a list of columns to create
              val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))

              testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)


              If you know a list of columns upfront you can simplify this a little:



              val cols: Seq[String] = ???
              val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }


              For Python equivalent see How to split Vector into columns - using PySpark.






              share|improve this answer




























                14












                14








                14







                One possible approach is something similar to this



                import org.apache.spark.sql.functions.udf
                import org.apache.spark.mllib.linalg.Vector

                // Get size of the vector
                val n = testDF.first.getAs[org.apache.spark.mllib.linalg.Vector](0).size

                // Simple helper to convert vector to array<double>
                val vecToSeq = udf((v: Vector) => v.toArray)

                // Prepare a list of columns to create
                val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))

                testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)


                If you know a list of columns upfront you can simplify this a little:



                val cols: Seq[String] = ???
                val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }


                For Python equivalent see How to split Vector into columns - using PySpark.






                share|improve this answer















                One possible approach is something similar to this



                import org.apache.spark.sql.functions.udf
                import org.apache.spark.mllib.linalg.Vector

                // Get size of the vector
                val n = testDF.first.getAs[org.apache.spark.mllib.linalg.Vector](0).size

                // Simple helper to convert vector to array<double>
                val vecToSeq = udf((v: Vector) => v.toArray)

                // Prepare a list of columns to create
                val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))

                testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)


                If you know a list of columns upfront you can simplify this a little:



                val cols: Seq[String] = ???
                val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }


                For Python equivalent see How to split Vector into columns - using PySpark.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited May 24 '18 at 23:47









                Community

                11




                11










                answered Jun 29 '16 at 21:41









                zero323zero323

                167k41488578




                167k41488578

























                    2














                    Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:



                    import org.apache.spark.ml.feature.VectorAssembler
                    import org.apache.spark.ml.linalg.Vectors

                    val dataset = spark.createDataFrame(
                    Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
                    ).toDF("id", "val1", "val2")


                    val assembler = new VectorAssembler()
                    .setInputCols(Array("val1", "val2"))
                    .setOutputCol("vectorCol")

                    val output = assembler.transform(dataset)
                    output.show()
                    /*
                    +---+----+----+---------+
                    | id|val1|val2|vectorCol|
                    +---+----+----+---------+
                    | 0| 1.2| 1.3|[1.2,1.3]|
                    | 1| 2.2| 2.3|[2.2,2.3]|
                    | 2| 3.2| 3.3|[3.2,3.3]|
                    +---+----+----+---------+*/

                    val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
                    .setInputCol("vectorCol")
                    disassembler.transform(output).show()
                    /*
                    +---+----+----+---------+----+----+
                    | id|val1|val2|vectorCol|val1|val2|
                    +---+----+----+---------+----+----+
                    | 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
                    | 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
                    | 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
                    +---+----+----+---------+----+----+*/





                    share|improve this answer



















                    • 1





                      VectorDisassembler never got into Spark (SPARK-13610).

                      – hi-zir
                      May 9 '18 at 18:41
















                    2














                    Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:



                    import org.apache.spark.ml.feature.VectorAssembler
                    import org.apache.spark.ml.linalg.Vectors

                    val dataset = spark.createDataFrame(
                    Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
                    ).toDF("id", "val1", "val2")


                    val assembler = new VectorAssembler()
                    .setInputCols(Array("val1", "val2"))
                    .setOutputCol("vectorCol")

                    val output = assembler.transform(dataset)
                    output.show()
                    /*
                    +---+----+----+---------+
                    | id|val1|val2|vectorCol|
                    +---+----+----+---------+
                    | 0| 1.2| 1.3|[1.2,1.3]|
                    | 1| 2.2| 2.3|[2.2,2.3]|
                    | 2| 3.2| 3.3|[3.2,3.3]|
                    +---+----+----+---------+*/

                    val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
                    .setInputCol("vectorCol")
                    disassembler.transform(output).show()
                    /*
                    +---+----+----+---------+----+----+
                    | id|val1|val2|vectorCol|val1|val2|
                    +---+----+----+---------+----+----+
                    | 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
                    | 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
                    | 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
                    +---+----+----+---------+----+----+*/





                    share|improve this answer



















                    • 1





                      VectorDisassembler never got into Spark (SPARK-13610).

                      – hi-zir
                      May 9 '18 at 18:41














                    2












                    2








                    2







                    Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:



                    import org.apache.spark.ml.feature.VectorAssembler
                    import org.apache.spark.ml.linalg.Vectors

                    val dataset = spark.createDataFrame(
                    Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
                    ).toDF("id", "val1", "val2")


                    val assembler = new VectorAssembler()
                    .setInputCols(Array("val1", "val2"))
                    .setOutputCol("vectorCol")

                    val output = assembler.transform(dataset)
                    output.show()
                    /*
                    +---+----+----+---------+
                    | id|val1|val2|vectorCol|
                    +---+----+----+---------+
                    | 0| 1.2| 1.3|[1.2,1.3]|
                    | 1| 2.2| 2.3|[2.2,2.3]|
                    | 2| 3.2| 3.3|[3.2,3.3]|
                    +---+----+----+---------+*/

                    val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
                    .setInputCol("vectorCol")
                    disassembler.transform(output).show()
                    /*
                    +---+----+----+---------+----+----+
                    | id|val1|val2|vectorCol|val1|val2|
                    +---+----+----+---------+----+----+
                    | 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
                    | 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
                    | 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
                    +---+----+----+---------+----+----+*/





                    share|improve this answer













                    Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:



                    import org.apache.spark.ml.feature.VectorAssembler
                    import org.apache.spark.ml.linalg.Vectors

                    val dataset = spark.createDataFrame(
                    Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
                    ).toDF("id", "val1", "val2")


                    val assembler = new VectorAssembler()
                    .setInputCols(Array("val1", "val2"))
                    .setOutputCol("vectorCol")

                    val output = assembler.transform(dataset)
                    output.show()
                    /*
                    +---+----+----+---------+
                    | id|val1|val2|vectorCol|
                    +---+----+----+---------+
                    | 0| 1.2| 1.3|[1.2,1.3]|
                    | 1| 2.2| 2.3|[2.2,2.3]|
                    | 2| 3.2| 3.3|[3.2,3.3]|
                    +---+----+----+---------+*/

                    val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
                    .setInputCol("vectorCol")
                    disassembler.transform(output).show()
                    /*
                    +---+----+----+---------+----+----+
                    | id|val1|val2|vectorCol|val1|val2|
                    +---+----+----+---------+----+----+
                    | 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
                    | 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
                    | 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
                    +---+----+----+---------+----+----+*/






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Jan 13 '17 at 16:29









                    BoernBoern

                    2,91433050




                    2,91433050








                    • 1





                      VectorDisassembler never got into Spark (SPARK-13610).

                      – hi-zir
                      May 9 '18 at 18:41














                    • 1





                      VectorDisassembler never got into Spark (SPARK-13610).

                      – hi-zir
                      May 9 '18 at 18:41








                    1




                    1





                    VectorDisassembler never got into Spark (SPARK-13610).

                    – hi-zir
                    May 9 '18 at 18:41





                    VectorDisassembler never got into Spark (SPARK-13610).

                    – hi-zir
                    May 9 '18 at 18:41


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f38110038%2fspark-scala-how-to-convert-dataframevector-to-dataframef1double-fn-d%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Homophylophilia

                    Updating UILabel text programmatically using a function

                    Cloud Functions - OpenCV Videocapture Read method fails for larger files from cloud storage