模块六-数据合并与连接——33. merge 合并(下)
33. merge 合并下1. 概述本章继续讲解merge的高级用法包括合并指示器、重复键处理、基于索引的合并等高级技巧。importpandasaspdimportnumpyasnp# 创建示例数据np.random.seed(42)2. 合并指示器indicator参数可以添加一列显示每行数据的来源。# 创建演示数据df_leftpd.DataFrame({key:[A,B,C,D],value:[1,2,3,4]})df_rightpd.DataFrame({key:[B,C,D,E],value:[5,6,7,8]})print(左表:)print(df_left)print(\n右表:)print(df_right)# 添加合并指示器resultpd.merge(df_left,df_right,onkey,howouter,indicatorTrue)print(\n带合并指示器的结果:)print(result)# 自定义指示器列名resultpd.merge(df_left,df_right,onkey,howouter,indicator来源)print(\n自定义列名:)print(result)3. 处理重复键3.1 左表有重复键# 左表有重复键df_left_duppd.DataFrame({key:[A,A,B,B,C],value_left:[1,2,3,4,5]})df_rightpd.DataFrame({key:[A,B,C],value_right:[10,20,30]})print(左表有重复键:)print(df_left_dup)print(\n右表:)print(df_right)# 内连接会产生笛卡尔积resultpd.merge(df_left_dup,df_right,onkey,howinner)print(\n内连接结果重复键产生多行:)print(result)3.2 两表都有重复键# 两表都有重复键df_left_dup2pd.DataFrame({key:[A,A,B],value_left:[1,2,3]})df_right_dup2pd.DataFrame({key:[A,A,B],value_right:[10,20,30]})print(左表:)print(df_left_dup2)print(\n右表:)print(df_right_dup2)# 内连接产生笛卡尔积resultpd.merge(df_left_dup2,df_right_dup2,onkey,howinner)print(\n内连接结果:)print(result)4. 基于索引的合并4.1 使用索引作为连接键# 创建以索引为键的数据df1pd.DataFrame({姓名:[张三,李四,王五],年龄:[25,30,28]},index[101,102,103])df2pd.DataFrame({部门:[技术部,销售部,市场部],工资:[8000,12000,10000]},index[101,102,104])print(df1:)print(df1)print(\ndf2:)print(df2)# 使用 left_index 和 right_indexresultpd.merge(df1,df2,left_indexTrue,right_indexTrue,howouter)print(\n基于索引合并:)print(result)4.2 混合使用列和索引# 左表用列右表用索引df_leftpd.DataFrame({员工ID:[101,102,103,104],姓名:[张三,李四,王五,赵六],部门ID:[1,2,1,3]})df_rightpd.DataFrame({部门名称:[技术部,销售部,市场部,人事部]},index[1,2,3,4])print(左表:)print(df_left)print(\n右表:)print(df_right)# 左表用部门ID列右表用索引resultpd.merge(df_left,df_right,left_on部门ID,right_indexTrue,howleft)print(\n混合合并:)print(result)5. 验证完整性validate参数用于检查合并关系的完整性。# 一对一关系print(一对一关系验证:)df1_onepd.DataFrame({key:[1,2,3],value1:[A,B,C]})df2_onepd.DataFrame({key:[1,2,3],value2:[X,Y,Z]})resultpd.merge(df1_one,df2_one,onkey,validateone_to_one)print(一对一合并成功)print(result)# 一对多关系print(\n一对多关系验证:)df1_one_to_manypd.DataFrame({key:[1,1,2],value1:[A1,A2,B]})df2_one_to_manypd.DataFrame({key:[1,2],value2:[X,Y]})resultpd.merge(df1_one_to_many,df2_one_to_many,onkey,validatemany_to_one)print(一对多合并成功)print(result)6. 完整示例多表关联分析# 创建三张表print(*60)print(多表关联分析)print(*60)# 订单表orderspd.DataFrame({订单号:[O001,O002,O003,O004,O005,O006],客户ID:[1,2,1,3,2,4],产品ID:[101,102,101,103,102,101],数量:[2,1,3,1,2,1],订单日期:pd.date_range(2024-01-01,periods6)})# 客户表customerspd.DataFrame({客户ID:[1,2,3,5],客户姓名:[张三,李四,王五,赵六],城市:[北京,上海,广州,深圳],等级:[黄金,铂金,黄金,普通]})# 产品表productspd.DataFrame({产品ID:[101,102,103,104],产品名称:[手机,电脑,平板,耳机],单价:[5000,8000,3000,500]})print(订单表:)print(orders.head())print(\n客户表:)print(customers)print(\n产品表:)print(products)# 1. 订单关联客户print(\n1. 订单客户:)order_customerpd.merge(orders,customers,on客户ID,howleft)print(order_customer.head())# 2. 再关联产品print(\n2. 订单客户产品:)order_fullpd.merge(order_customer,products,on产品ID,howleft)print(order_full.head())# 3. 计算订单金额order_full[金额]order_full[数量]*order_full[单价]print(\n3. 订单金额:)print(order_full[[订单号,产品名称,数量,单价,金额]])# 4. 客户消费统计print(\n4. 客户消费统计:)customer_statsorder_full.groupby(客户姓名).agg({金额:[sum,mean,count]}).round(2)print(customer_stats)# 5. 产品销售统计print(\n5. 产品销售统计:)product_statsorder_full.groupby(产品名称).agg({数量:sum,金额:sum})print(product_stats)# 6. 使用 indicator 追踪数据来源print(\n6. 带指示器的合并:)order_with_sourcepd.merge(orders,customers,on客户ID,howouter,indicator客户匹配)print(order_with_source[[订单号,客户ID,客户姓名,客户匹配]])7. merge 参数完整列表参数说明示例left左表df1right右表df2how连接方式inner,left,right,outeron连接键列名key或[key1,key2]left_on左表连接键left_keyright_on右表连接键right_keyleft_index使用左表索引Trueright_index使用右表索引Truesuffixes重复列后缀(_x, _y)indicator添加来源列True或自定义名称validate验证关系one_to_one,one_to_many,many_to_one8. 总结高级功能参数示例合并指示器indicatorTruepd.merge(df1, df2, onkey, indicatorTrue)基于索引合并left_indexTrue, right_indexTruepd.merge(df1, df2, left_indexTrue, right_indexTrue)混合合并left_oncol, right_indexTruepd.merge(df1, df2, left_onkey, right_indexTrue)关系验证validateone_to_onepd.merge(df1, df2, onkey, validateone_to_one)自定义后缀suffixes(_左, _右)pd.merge(df1, df2, onkey, suffixes(_左, _右))