阿毛
It's me !
想你所想
hive输出笛卡尔积排错

今天在执行hive sql的时候,需要了奇怪的join错误,这里总结下

问题sql如下

-- 2019-05-03
-- 求2019-05-03 之前发布的视频作者中,其现会员人数及占比,历史会员人数及占比
-- 旧视频(作者id)
with
old_video_user as (
    select distinct user_id from `dw`.`dim_video` where published_at <= '2019-05-03'
),

-- 昨日用户vip等级表。可以快速过滤出,现vip用户,一天vip时效性误差不考虑
vip_user as (
    select old_video_user.user_id from (select distinct user_id from parquet.`/idl/user_vip_level` where vip != 0) vip join old_video_user on vip.user_id = old_video_user.user_id
),

-- 当前不是会员,但以往开过会员
old_vip_user as (
    select no_vip_now.user_id from (select * from old_video_user where user_id not in (select user_id from vip_user)) no_vip_now join (select user_id from `dw`.`dim_user` where vip_expired_at < now()) old_vip on no_vip_now.user_id = old_vip.user_id
)

select all_count.count as `总用户数`, vip_count.count as `当前vip用户数`, round(vip_count.count / all_count.count, 3) as `当前vip用户占总用户占比`, old_vip_count.count as `历史vip用户数`, round(old_vip_count.count / all_count.count, 3) as `历史vip用户占总用户占比`, (all_count.count - vip_count.count - old_vip_count.count) as `普通用户数`, round((all_count.count - vip_count.count - old_vip_count.count) / all_count.count, 3) as `普通用户占总用户占比` 
from (select count(user_id) as count, 1 as id from old_video_user) all_count, (select count(user_id) as count, 1 as id from vip_user) vip_count, (select count(user_id) as count, 1 as id from old_vip_user) old_vip_count

执行结果

java.sql.SQLException: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Join Inner
:- Aggregate [count(user_id#300) AS count#264L]
:  +- Aggregate [user_id#300], [user_id#300]
:     +- Project [user_id#300]
:        +- Filter (isnotnull(published_at#323) && (published_at#323 <= 2019-05-03))
:           +- Relation[photos_count#278,duration#279,name#280,channel#281,point#282,mp4_etag#283,catalog#284,logo_etag#285,scenes#286,digest#287,shared_at#288,state#289,shared#290,post_id#291,tags#292,published#293,audio#294,logo_path#295,udid#296,mold_id#297,create_type#298,last_comment_id#299,user_id#300,reference_id#301,... 26 more fields] parquet
+- Aggregate [count(user_id#300) AS count#266L]
   +- Project [user_id#300]
      +- Join Inner, (user_id#378 = user_id#300)
         :- Aggregate [user_id#378], [user_id#378]
         :  +- Project [user_id#378]
         :     +- Filter ((isnotnull(vip#379) && NOT (cast(vip#379 as int) = 0)) && isnotnull(user_id#378))
         :        +- Relation[user_id#378,vip#379,date#380] parquet
         +- Aggregate [user_id#300], [user_id#300]
            +- Project [user_id#300]
               +- Filter ((isnotnull(published_at#323) && (published_at#323 <= 2019-05-03)) && isnotnull(user_id#300))
                  +- Relation[photos_count#278,duration#279,name#280,channel#281,point#282,mp4_etag#283,catalog#284,logo_etag#285,scenes#286,digest#287,shared_at#288,state#289,shared#290,post_id#291,tags#292,published#293,audio#294,logo_path#295,udid#296,mold_id#297,create_type#298,last_comment_id#299,user_id#300,reference_id#301,... 26 more fields] parquet
and
Aggregate [count(user_id#300) AS count#268L]
+- Project [user_id#300]
   +- Join Inner, (user_id#300 = user_id#381)
      :- Aggregate [user_id#300], [user_id#300]
      :  +- Project [user_id#300]
      :     +- Filter ((isnotnull(published_at#323) && ((published_at#323 <= 2019-05-03) && NOT user_id#300 IN (list#277 []))) && isnotnull(user_id#300))
      :        :  +- Project [user_id#300 AS user_id#300#467]
      :        :     +- Join Inner, (user_id#378 = user_id#300)
      :        :        :- Aggregate [user_id#378], [user_id#378]
      :        :        :  +- Project [user_id#378]
      :        :        :     +- Filter ((isnotnull(vip#379) && NOT (cast(vip#379 as int) = 0)) && isnotnull(user_id#378))
      :        :        :        +- Relation[user_id#378,vip#379,date#380] parquet
      :        :        +- Aggregate [user_id#300], [user_id#300]
      :        :           +- Project [user_id#300]
      :        :              +- Filter ((isnotnull(published_at#323) && (published_at#323 <= 2019-05-03)) && isnotnull(user_id#300))
      :        :                 +- Relation[photos_count#278,duration#279,name#280,channel#281,point#282,mp4_etag#283,catalog#284,logo_etag#285,scenes#286,digest#287,shared_at#288,state#289,shared#290,post_id#291,tags#292,published#293,audio#294,logo_path#295,udid#296,mold_id#297,create_type#298,last_comment_id#299,user_id#300,reference_id#301,... 26 more fields] parquet
      :        +- Relation[photos_count#278,duration#279,name#280,channel#281,point#282,mp4_etag#283,catalog#284,logo_etag#285,scenes#286,digest#287,shared_at#288,state#289,shared#290,post_id#291,tags#292,published#293,audio#294,logo_path#295,udid#296,mold_id#297,create_type#298,last_comment_id#299,user_id#300,reference_id#301,... 26 more fields] parquet
      +- Project [user_id#381]
         +- Filter ((isnotnull(vip_expired_at#393) && (vip_expired_at#393 < 2021-07-19 16:56:54.135)) && (NOT user_id#381 IN (list#277 []) && isnotnull(user_id#381)))
            :  +- Project [user_id#300 AS user_id#300#467]
            :     +- Join Inner, (user_id#378 = user_id#300)
            :        :- Aggregate [user_id#378], [user_id#378]
            :        :  +- Project [user_id#378]
            :        :     +- Filter ((isnotnull(vip#379) && NOT (cast(vip#379 as int) = 0)) && isnotnull(user_id#378))
            :        :        +- Relation[user_id#378,vip#379,date#380] parquet
            :        +- Aggregate [user_id#300], [user_id#300]
            :           +- Project [user_id#300]
            :              +- Filter ((isnotnull(published_at#323) && (published_at#323 <= 2019-05-03)) && isnotnull(user_id#300))
            :                 +- Relation[photos_count#278,duration#279,name#280,channel#281,point#282,mp4_etag#283,catalog#284,logo_etag#285,scenes#286,digest#287,shared_at#288,state#289,shared#290,post_id#291,tags#292,published#293,audio#294,logo_path#295,udid#296,mold_id#297,create_type#298,last_comment_id#299,user_id#300,reference_id#301,... 26 more fields] parquet
            +- Relation[user_id#381,name#382,vip#383,icon#384,udid#385,age_zone#386,signature#387,created_at#388,level#389,platform#390,auth_token#391,updated_at#392,vip_expired_at#393,gender#394,first_udid#395,last_on#396,is_wechat#397,version#398,app_version#399,os#400,os_version#401,device_name#402,from_src#403,original_src#404,... 13 more fields] parquet
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
	at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
	at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
	at org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:736)
	at org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:819)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:103)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:632)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
	at org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

执行结果中,错误信息提示,join存在问题,存在笛卡尔积的问题。一开始看到这个错误信息,加上“Join condition is missing or trivial” ,我就排查sql中的join on对着没,确认没有任何问题啊,摸不着头绪。最后才留意到最后总sql中from多表查询的问题。

原本我是想,每个表都count一下,然后多个count后的表联合查询一下,就算笛卡尔积也是“1*1*1=1”,也不影响。

于是,谷歌了一下错误信息。才理解,原来环境中搭建的ThriftServer其对应spark配置并没有对笛卡尔积特殊处理,也就是上面最后一句sql,from多表没有where条件,也就是将产生的笛卡尔积直接输出,而hive where可能使用inner join方式(跟mysql原理不太一样??todo待查)inner join禁止不带有on条件,直接将笛卡尔积输出!是可能考虑到,如果join直接输出笛卡尔积结果,肯定是数据结果错误,不是理想的结果。所以这里错误信息中指出,建议使用cross join sql,或开启cross join可用的配置。cross join就是hive中可以直接输出两个表笛卡尔积的一种连接查询方式。所以若采用cross join的方式运行,你可以将from多表改为一层层cross join。也可以直接设置cross join配置可用(spark.sql.crossJoin.enabled=true),如直接在ThriftServer的spark配置文件中,加上该配置项。这样再次执行上面sql,hive会自动优化为cross join的方式。当然,除了采用错误信息提醒的方式,也可以增加where条件(毕竟where hive应该对应inner join,错误信息中,不就提示没有join条件吗),所以这里给每个count表增加一个用于条件过滤的相同列,然后where指定一下,具体最后一句sql修改如下

select all_count.count as `总用户数`, vip_count.count as `当前vip用户数`, round(vip_count.count / all_count.count, 3) as `当前vip用户占总用户占比`, old_vip_count.count as `历史vip用户数`, round(old_vip_count.count / all_count.count, 3) as `历史vip用户占总用户占比`, (all_count.count - vip_count.count - old_vip_count.count) as `普通用户数`, round((all_count.count - vip_count.count - old_vip_count.count) / all_count.count, 3) as `普通用户占总用户占比` 
from (select count(user_id) as count, 1 as id from old_video_user) all_count, (select count(user_id) as count, 1 as id from vip_user) vip_count, (select count(user_id) as count, 1 as id from old_vip_user) old_vip_count where all_count.id = vip_count.id and all_count.id = old_vip_count.id

再次完整执行,结果成功!


一般不建议直接输出笛卡尔积,我遇到的这个情况,刚好是已知每个表只有一条记录。

发表评论

textsms
account_circle
email

想你所想

hive输出笛卡尔积排错
今天在执行hive sql的时候,需要了奇怪的join错误,这里总结下 问题sql如下 -- 2019-05-03 -- 求2019-05-03 之前发布的视频作者中,其现会员人数及占比,历史会员人数及占比 -- 旧…
扫描二维码继续阅读
2021-07-19