Interestingly, if you read spark's config with ().getAll() (or from Spark UI), it tells you your driver memory is 2g, which is obviously not true. However, you get : GC overhead limit exceeded, which means your driver memory is still 512m! The driver memory wasn't updated because the driver JVM was already started when it received the new config. To run without errors, as your session's is seemingly set to 2g. Spark.stop() # to set new configs, you must first stop the running session Now, you'd expect this: spark = ("", "512m").getOrCreate() However if you try that with 2g memory (again, with fresh Python interpreter): spark = ("", "2g").getOrCreate() The code throws : GC overhead limit exceeded as 10M rows won't fit into 512m driver. To prove it, first run the following code against a fresh Python intepreter: spark = ("", "512m").getOrCreate() Setting through only works if the driver JVM hasn't been started before. SPARK_CONF_DIR or SPARK_HOME/conf ( Recommended) Spark-shell, Jupyter Notebook or any other environment where you already initialized Spark ( Not Recommended). U'12g' # which is 12G for the driver I have used You can check the driver memory by using or eventually for what you have specified about spark.sparkContext._conf.getAll() works too. I would like to say that the documentation is right. I tried spark.sparkContext._conf.getAll() as well as Spark web UI but it seems to lead to a wrong answer." If the document is right, is there a proper way that I can check after config. You can tell SPARK in your environment to read the default settings from SPARK_CONF_DIR or $SPARK_HOME/conf where the driver-memory can be configured. Now the line ended with the following phrase When you are trying to submit a Spark job against client, you can set the driver memory by using -driver-memory flag, say spark-submit -deploy-mode client -driver-memory 12G Instead, please set this through the -driver-memory, it implies that Now, if you go by this line ( Spark is fine with this) You can tell the JVM to instantiate itself (JVM) with 9g of driver memory by using SparkConf. So, that particular comment this config must not be set through the SparkConf directly does not apply in the documentation. Hence, if you set it using, it accepts the change and overrides it. Means you can set the driver memory, but it it is not recommended at RUN TIME. This config must not be set through the SparkConf directly config("", "9g")\ # This will work (Not recommended) I tried spark.sparkContext._conf.getAll() as well as Spark web UI but it seems to lead to a wrong answer. The web UI and spark.sparkContext._conf.getAll() returned '10g'. document hereīut, as you see in the result above, it returnsĮven when I access to the spark web UI (on port 4040, environment tab), it still shows Instead, please set this through the -driver-memory command line option or in your default properties file. Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. It's quite of weird because when I look at the document, it shows that ('spark.ui.showConsoleProgress', 'true'), Spark.sparkContext._conf.getAll() # check the config I want to set to 9Gb by doing this: spark = SparkSession.builder \ I'm new to PySpark and I'm trying to use pySpark (ver 2.3.1) on my local computer with Jupyter-Notebook.
0 Comments
Leave a Reply. |