{"id":1278,"date":"2022-06-27T20:50:45","date_gmt":"2022-06-27T12:50:45","guid":{"rendered":"https:\/\/blog.humh.cn\/?p=1278"},"modified":"2022-06-27T21:00:02","modified_gmt":"2022-06-27T13:00:02","slug":"spark-show-%e6%9c%89%e6%95%b0%e6%8d%ae%ef%bc%8c%e4%bd%86count%e7%ab%9f%e7%84%b6%e4%b8%ba0%ef%bc%9f","status":"publish","type":"post","link":"https:\/\/blog.humh.cn\/?p=1278","title":{"rendered":"spark show() \u6709\u6570\u636e\uff0c\u4f46 count() \u7adf\u7136\u4e3a 0 \uff1f"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote\"><p>\u5728\u5de5\u4f5c\u4e2d\uff0c\u53d1\u73b0\u4e00\u4e2a\u7ebf\u4e0a\u95ee\u9898\uff0c\u67d0\u4e00\u5929\u4efb\u52a1\u5c11\u6570\u636e\uff0c\u5bf9\u539f\u59cb\u6570\u636e\u5206\u6790\uff0c\u5bf9 DataFrame \u8c03\u7528 show() \u6709\u6570\u636e\uff0c\u4f46 count() \u7adf\u7136\u4e3a 0 \uff1f\uff1f<\/p><\/blockquote>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"scala\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">val df = spark.read.parquet(\"\/tmp\/test_data\")\ndf.show(10)\npintln(df.count())<\/pre>\n\n\n\n<p>\u4ee3\u7801\u5982\u4e0a\uff0cshow() \u7684\u65f6\u5019\u6709\u6570\u636e\uff0c\u4f46 count() \u7684\u7ed3\u679c\u7adf\u7136\u4e3a 0 \uff01<\/p>\n\n\n\n<p>\u5355\u4ece\u4ee3\u7801\u4e0a\u8fdb\u884c\u5206\u6790\uff0c\u5f88\u96be\u5206\u6790\u51fa\u6765\u6570\u636e\u7684\u95ee\u9898\uff0c\u4e5f\u6267\u884c\u4e86 printSchema \uff0c\u4e5f\u53ef\u4ee5\u6b63\u5e38\u8f93\u51fa\u6570\u636e\u7684 schema \u3002<\/p>\n\n\n\n<p>\u6240\u4ee5\u8003\u8651 yarn \u4e0a\u89c2\u5bdf\u8be5\u4efb\u52a1\uff0c\u5bfb\u627e\u539f\u56e0\uff0c\u4ece yarn ui \u4e0a\u8be5\u4efb\u52a1 Tracking URL \u8fdb\u5165\uff0c\u4ece Completed Jobs \u4e2d\u627e\u5230 count() \u51fd\u6570\u6267\u884c\u7684\u90a3\u4e2a task \uff0c\u627e\u5230\u5176 Logs stderr \uff0c\u5f88\u5feb\u53d1\u73b0\u4e86\u62a5\u9519\u3002<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">22\/06\/27 17:27:33 WARN FileScanRDD: Skipped the rest of the content in the corrupted file: path: hdfs:\/\/HDFS001\/tmp\/test_data\/part-00019-d38910e3-f5d5-4f87-a820-a2eb680f9501-c000.snappy.parquet, range: 0-3466308, partition values: [empty row]\njava.lang.RuntimeException: Found duplicate field(s) \"etaG\": [etaG, etag] in case-insensitive mode<\/pre>\n\n\n\n<p>\u201c<code>Found duplicate field(s) \"etaG\": [etaG, etag] in case-insensitive mode<\/code>\u201d \u8fd9\u4e2a\u9519\u8bef\u662f\u5728\u4e8e\uff0c\u9ed8\u8ba4 spark \u662f\u4e0d\u533a\u5206\u5927\u5c0f\u5199\u7684\uff0c\u6bd4\u5982\u5728\u5408\u5e76 schema \u7684\u65f6\u5019\uff0c\u5728\u4ece\u5185\u7f6e\u6570\u636e\u6e90 Parquet\u3001ORC\u3001Avro \u548c JSON \u8bfb\u53d6\u65f6\uff0c\u68c0\u67e5\u540c\u4e00\u7ea7\u522b\uff08\u9876\u7ea7\u6216\u5d4c\u5957\u7ea7\u522b\uff09\u4e0a\u6ca1\u6709\u91cd\u590d\u7684\u5217\u540d\u3002\u5982\u679c\u5b58\u5728\u8fd9\u6837\u7684\u91cd\u590d\u5217\uff0c\u5219\u629b\u51fa\u8fd9\u6837\u7684\u5f02\u5e38\u3002<\/p>\n\n\n\n<p>\u800c\u8be5\u6570\u636e\u4e2d\u5b58\u5728\u4e24\u4e2a\u5ffd\u7565\u5927\u5c0f\u5199\u60c5\u51b5\u4e0b\u76f8\u540c\u7684\u5b57\u6bb5\uff0c\u201ceatg\u201d \u548c \u201cetaG\u201d \uff0c\u6240\u4ee5\u5408\u5e76 schema \u7684\u65f6\u5019\u62a5\u9519\u4e86\u3002\u90a3\u4e48\u5982\u4f55\u63a7\u5236 spark \u5728\u8ba1\u7b97\u7684\u65f6\u5019\u533a\u5206\u5927\u5c0f\u5199\u5462\uff1f\u662f\u901a\u8fc7 <strong>spark.sql.caseSensitive<\/strong> \u8fd9\u4e2a\u914d\u7f6e\u9879\u63a7\u5236\u7684\uff0c\u9ed8\u8ba4\u662f <strong>false<\/strong> \uff0c\u5373\u4e0d\u533a\u5206\uff0c true \u5219\u533a\u5206\u5927\u5c0f\u5199\u3002\u8fd9\u91cc\u56e0\u4e3a\u6211\u7684 spark \u4efb\u52a1\u914d\u7f6e\u8bfb\u53d6\u7684\u662f\u673a\u5668\u4e0a <strong>SPARK_HOME<\/strong> \u4e0b\u7684 conf \uff0c\u6240\u4ee5\u8fd9\u91cc\u6211\u5728 conf \u91cc\u52a0\u4e0a <strong>spark.sql.caseSensitive: true<\/strong> \u3002\u5982\u679c\uff0c\u4f60\u662f\u4ee3\u7801\u63a7\u5236\u914d\u7f6e\u7684\u8bdd \uff0c\u4e5f\u53ef\u4ee5\u901a\u8fc7\u52a0\u4e0a\u4e0b\u9762\u8fd9\u884c\u4ee3\u7801\u8fdb\u884c\u63a7\u5236\uff1a<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"scala\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\/\/ sql\u65b9\u5f0f\nspark.sql(\u201cset spark.sql.caseSensitive=true\u201d)\n\/\/ \u914d\u7f6e\u9879\u65b9\u5f0f\nspark.sqlContext.setConf(\"spark.sql.caseSensitive\", \"true\")<\/pre>\n\n\n\n<p>\u91cd\u65b0\u6267\u884c\u4efb\u52a1\uff0c\u53d1\u73b0\u62a5\u9519\u6d88\u5931\uff0c\u6570\u636e\u6b63\u5e38\u3002<\/p>\n\n\n\n<p>\u8fd9\u91cc\uff0c\u4f60\u53ef\u80fd\u4f1a\u7eb3\u95f7\uff0c\u8fd9\u4e2a\u9519\u8bef\u4e3a\u4ec0\u4e48\u4f1a\u9020\u6210 <strong>show<\/strong> \u6b63\u5e38\uff0c\u4f46 <strong>count<\/strong> \u4e3a 0 \u5462\uff0c\u4e3a\u4ec0\u4e48\u4efb\u52a1\u6267\u884c\u4e0d\u5931\u8d25\u5462\uff1f\u5176\u5b9e\uff0c\u6700\u76f8\u5173\u7684\u4fe1\u606f\uff0c\u662f\u8fd9\u4e00\u884c \u201c<meta charset=\"utf-8\"><code>22\/06\/27 17:27:33 WARN FileScanRDD: Skipped the rest of the content in the corrupted file: path: hdfs:\/\/HDFS001\/tmp\/test_data\/part-00019-d38910e3-f5d5-4f87-a820-a2eb680f9501-c000.snappy.parquet, range: 0-3466308, partition values: [empty row]<\/code> \u201d \u3002<\/p>\n\n\n\n<p>\u8fd9\u4e2a\u4fe1\u606f\u7684\u539f\u56e0\u662f\uff0cspark \u9ed8\u8ba4\u4f1a\u8df3\u8fc7\u201c\u635f\u574f\u201d\u7684\u6587\u4ef6\uff0c\u7136\u540e\u4f1a\u5c06\u635f\u574f\u7684\u6587\u4ef6\u4f5c\u4e3a WARN \u6d88\u606f\u8bb0\u5f55\u5728\u7a0b\u5e8f\u6267\u884c\u7a0b\u5e8f\u65e5\u5fd7\u4e2d\u3002\u8fd9\u91cc\u51fa\u73b0\u5927\u5c0f\u5199\u5b57\u6bb5\u65e0\u6cd5\u5408\u5e76\u7684\u6587\u4ef6\uff0c\u4e5f\u5c31\u88ab\u5f53\u6210\u4e86\u201c\u635f\u574f\u201d\u7684\u6587\u4ef6\uff0c\u6240\u4ee5 <strong>count()<\/strong> \u51fd\u6570\u6267\u884c\u7684\u65f6\u5019\uff0c\u5ffd\u7565\u4e86\u8fd9\u4e9b\u6570\u636e\u6587\u4ef6\u3002\u4f46\u662f\u81f3\u4e8e <strong>show()<\/strong> \u4e3a\u4ec0\u4e48\u4e0d\u8df3\u8fc7\uff0c\u6682\u65f6\u672a\u77e5\uff0c\u9700\u8981\u518d\u4ed4\u7ec6\u7814\u7a76\u3002<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"scala\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">spark.read.parquet(\"\/tmp\/test_data\").agg(count(\"type\")).show()<\/pre>\n\n\n\n<p>\u540c\u6837\u6d4b\u8bd5\u53d1\u73b0\uff0c\u4e0a\u9762\u8fd9\u79cd\u65b9\u5f0f\u4e5f\u53ef\u4ee5\u6b63\u5e38\u8f93\u51fa\u3002<\/p>\n\n\n\n<p>\u9047\u5230\u201c\u635f\u574f\u201d\u6587\u4ef6\u7684\u7b56\u7565\uff0c\u662f\u7531\u8fd9\u4e24\u9879\u914d\u7f6e\u63a7\u5236\u7684\u3002<br>RDD\uff1a<strong>spark.files.ignoreCorruptFiles <\/strong><br>DataFrame\uff1a<strong>spark.sql.files.ignoreCorruptFiles<\/strong><\/p>\n\n\n\n<p>\u5982\u679c\u9700\u8981\u9047\u5230\u201c\u635f\u574f\u201d\u6587\u4ef6\u4e0d\u8df3\u8fc7\u7684\u8bdd\uff0c\u5219\u5c06\u4e24\u4e2a\u5c5e\u6027\u503c\u8bbe\u7f6e\u4e3a <strong>false<\/strong> \u5373\u53ef\u3002<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h4>\u672c\u6587\u53c2\u8003<\/h4>\n\n\n\n<ul><li>The .schema() API behaves incorrectly for nested schemas that have column duplicates in case-insensitive mode\uff1a<a href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-32431\">https:\/\/issues.apache.org\/jira\/browse\/SPARK-32431<\/a><\/li><li>Spark SQL Upgrading Guide\uff1a<a rel=\"noreferrer noopener\" href=\"https:\/\/spark.apache.org\/docs\/2.4.2\/sql-migration-guide-upgrade.html\" target=\"_blank\">https:\/\/spark.apache.org\/docs\/2.4.2\/sql-migration-guide-upgrade.html<\/a><\/li><li>Spark Configuration Properties\uff1a<a rel=\"noreferrer noopener\" href=\"https:\/\/jaceklaskowski.gitbooks.io\/mastering-spark-sql\/content\/spark-sql-properties.html\" target=\"_blank\">https:\/\/jaceklaskowski.gitbooks.io\/mastering-spark-sql\/content\/spark-sql-properties.html<\/a><\/li><li>Spark &#8211; ignoring corrupted files\uff1a<a rel=\"noreferrer noopener\" href=\"https:\/\/stackoverflow.com\/questions\/53541593\/spark-ignoring-corrupted-files\" target=\"_blank\">https:\/\/stackoverflow.com\/questions\/53541593\/spark-ignoring-corrupted-files<\/a><\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>\u5728\u5de5\u4f5c\u4e2d\uff0c\u53d1\u73b0\u4e00\u4e2a\u7ebf\u4e0a\u95ee\u9898\uff0c\u67d0\u4e00\u5929\u4efb\u52a1\u5c11\u6570\u636e\uff0c\u5bf9\u539f\u59cb\u6570\u636e\u5206\u6790\uff0c\u5bf9 DataFrame \u8c03\u7528 show() \u6709\u6570 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1286,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[129],"tags":[131,130],"_links":{"self":[{"href":"https:\/\/blog.humh.cn\/index.php?rest_route=\/wp\/v2\/posts\/1278"}],"collection":[{"href":"https:\/\/blog.humh.cn\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.humh.cn\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.humh.cn\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.humh.cn\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1278"}],"version-history":[{"count":3,"href":"https:\/\/blog.humh.cn\/index.php?rest_route=\/wp\/v2\/posts\/1278\/revisions"}],"predecessor-version":[{"id":1290,"href":"https:\/\/blog.humh.cn\/index.php?rest_route=\/wp\/v2\/posts\/1278\/revisions\/1290"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.humh.cn\/index.php?rest_route=\/wp\/v2\/media\/1286"}],"wp:attachment":[{"href":"https:\/\/blog.humh.cn\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1278"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.humh.cn\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1278"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.humh.cn\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1278"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}