[Udemy] 學習筆記: Taming Big Data with Apache Spark and Python – Hands On!

點閱: 9

記錄在 udemy 上學 spark 的筆記

環境設定

python

  • python3

java

  • JDK and JRE: 安裝時要注意在路徑中不能有空白, windows 預設的 Program Files 會有問題
  • JAVA_HOME=C:\Program Files\Java\jdk-15.0.2
  • JAVA_HOME=C:\Program Files\Java\jdk-11.0.10
  • JAVA_HOME=C:\Program Files\Java\jdk1.8.0_281 (java 8)
  • 打開 C:\spark\conf 資料夾, 修改 log4j.properties.template 名稱為 log4j.properties ,把第 19 行的 log4j.rotCategory=INFO 改為 log4j.rotCategory=ERROR

hadoop

  • HADOOP_HOME=c:\winutils
  • https://sungod-spark.s3.amazonaws.com/winutils.exe 連結中下載工具,讓 PC 能夠了解 hadoop 存在於環境中而不用真的安裝 hadoop 環境。
  • mkdir c:\wintuils\bin,然後把 winutils.exe 放到此路徑中
  • c:\winutils\bin\winutils.exe chmod 777 \tmp\hive

環境變數

  • SPARK_HOME: c:\spark
  • JAVA_HOME: C:\Program Files\Java\jdk1.8.0_281
  • HADOOP_HOME: C:\winutils
  • path:
    • %SPARK_HOME%\bin
    • %JAVA_HOME%\bin

CH6 Installing the MovieLens movie rating dataset

CH7

出錯

  • 原因是: Java 版本太新 (我裝15, 建議降回11,最後裝回8才搞定)
    -mkdir c:\tmp\hive
  • spark-submit ratings-counter.py

Java 版本太新的話會跳出下列錯誤

Traceback (most recent call last):
  File "C:/sw/00-work/08-learning/12-Spark_Python/ratings-counter.py", line 12, in 
...
...
...
: java.lang.IllegalArgumentException: Unsupported class file major version 59

CH8 What’s new in Spark3

  • Old version of spark MLLib with RDDs is deprecated -> use dataframe based
  • Faster than spark2 about 17 times
  • Python2 is not support anymore
  • GPU instance support
  • Deeper k8s supported
  • Binary file support
  • [Graphs in CS ]SparkGraph and Cypher supported
  • ACID support in data lakes with Delta Lake

CH9. Introduction to Spark.

  • Spark is a fast ad general engine for large scale data processing
  • Spark is a distributed systems: Cluster manager to Executor
  • Just like MapReduce but it’s faster
  • Less lines of code in Spark than MapReduce for same task
  • Spark Streaming, Spark SQL, MLLib, GraphX, Spark Core

CH10. The Resilient Distributed Dataset (RDD)

  • 有彈性的分散式資料集
  • RDD 就是 spark 的基礎,就是一個很大的 Dataset object ,具有一些屬性與方法,讓他可以操作 big data
  • 可參考文獻
  • spark shell will create q Spark Context(sc) object to run RDD

About the Author

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *

Related Posts