Hits: 302
記錄在 udemy 上學 spark 的筆記
環境設定
python
- python3
java
- JDK and JRE: 安裝時要注意在路徑中不能有空白, windows 預設的
Program Files
會有問題 JAVA_HOME=C:\Program Files\Java\jdk-15.0.2JAVA_HOME=C:\Program Files\Java\jdk-11.0.10- JAVA_HOME=C:\Program Files\Java\jdk1.8.0_281 (java 8)
- 打開
C:\spark\conf
資料夾, 修改 log4j.properties.template 名稱為 log4j.properties ,把第 19 行的 log4j.rotCategory=INFO 改為 log4j.rotCategory=ERROR
hadoop
- HADOOP_HOME=c:\winutils
- 到 https://sungod-spark.s3.amazonaws.com/winutils.exe 連結中下載工具,讓 PC 能夠了解 hadoop 存在於環境中而不用真的安裝 hadoop 環境。
- mkdir c:\wintuils\bin,然後把 winutils.exe 放到此路徑中
- c:\winutils\bin\winutils.exe chmod 777 \tmp\hive
環境變數
- SPARK_HOME: c:\spark
- JAVA_HOME: C:\Program Files\Java\jdk1.8.0_281
- HADOOP_HOME: C:\winutils
- path:
- %SPARK_HOME%\bin
- %JAVA_HOME%\bin
CH6 Installing the MovieLens movie rating dataset
- 下載範例資料
- 解壓縮後放到專案底下
CH7
出錯
- 原因是: Java 版本太新 (我裝15, 建議降回11,最後裝回8才搞定)
-mkdir c:\tmp\hive - spark-submit ratings-counter.py
Java 版本太新的話會跳出下列錯誤
Traceback (most recent call last):
File "C:/sw/00-work/08-learning/12-Spark_Python/ratings-counter.py", line 12, in
...
...
...
: java.lang.IllegalArgumentException: Unsupported class file major version 59
CH8 What’s new in Spark3
- Old version of spark MLLib with RDDs is deprecated -> use dataframe based
- Faster than spark2 about 17 times
- Python2 is not support anymore
- GPU instance support
- Deeper k8s supported
- Binary file support
- [Graphs in CS ]SparkGraph and Cypher supported
- ACID support in data lakes with Delta Lake
CH9. Introduction to Spark.
- Spark is a fast ad general engine for large scale data processing
- Spark is a distributed systems: Cluster manager to Executor
- Just like MapReduce but it’s faster
- Less lines of code in Spark than MapReduce for same task
- Spark Streaming, Spark SQL, MLLib, GraphX, Spark Core
CH10. The Resilient Distributed Dataset (RDD)
- 有彈性的分散式資料集
- RDD 就是 spark 的基礎,就是一個很大的 Dataset object ,具有一些屬性與方法,讓他可以操作 big data
- 可參考文獻
- spark shell will create q
Spark Context(sc)
object to run RDD
Comments