因为爬取的wikipedia数据不规范,写入MySQL数据库时没能定义好Primary Key和字符类型,后来发现主要是有几个小问题:
- 地区人口一栏中有不少非数字(县级市比如保定市),城市一栏因为这个原因也是有重复值(后来删掉了县级市,不需要)
- 香港澳门城镇人口一栏缺失值,城市一栏有中括号
NULL值更新成0
mysql> UPDATE cities_population
-> SET
-> urban_population=0
-> WHERE urban_population IS null;
Query OK, 4 rows affected (0.00 sec)
Rows matched: 4 Changed: 4 Warnings: 0
删除县级市
mysql> DELETE FROM cities_population WHERE class='县级市';
Query OK, 772 rows affected (0.01 sec)
删除个别城市名称后的一个字符
mysql> UPDATE cities_population
-> SET city=CONCAT(LEFT(city, CHAR_LENGTH(city) -1), '')
-> WHERE city NOT LIKE '%市' AND city NOT LIKE '香港%' AND city NOT LIKE '澳门%';
Query OK, 12 rows affected (0.00 sec)
Rows matched: 12 Changed: 12 Warnings: 0
可以看到香港澳门城市名称里的中括号
mysql> SELECT * FROM cities_population WHERE city LIKE '%[%';
+-------+----------+-----------+------------+------------------+------------------+------------------+-------------+
| index | province | city | class | total_population | admin_population | urban_population | report_time |
+-------+----------+-----------+------------+------------------+------------------+------------------+-------------+
| 378 | 香港 | 香港 [2] | 特别行政区 | 7071576 | 7071576 | NULL | 2011-06-30 |
| 379 | 澳门 | 澳门 [3] | 特别行政区 | 552503 | 552503 | NULL | 2011-08-12 |
| 468 | 香港 | 香港 [5] | 特别行政区 | 6708389 | 6708389 | NULL | 2001-03-14 |
| 469 | 澳门 | 澳门 [6] | 特别行政区 | 435235 | 435235 | NULL | 2001-08-23 |
+-------+----------+-----------+------------+------------------+------------------+------------------+-------------+
4 rows in set (0.00 sec)
去掉中括号:
mysql> UPDATE cities_population
-> SET
-> city=province
-> WHERE city LIKE '%[%';
Query OK, 4 rows affected (0.00 sec)
Rows matched: 4 Changed: 4 Warnings: 0
检查城市名称重复次数:
mysql> SELECT COUNT(*) as repetitions,city
-> FROM cities_population
-> GROUP BY city
-> HAVING repetitions > 1;
mysql> SELECT * FROM cities_population
-> WHERE 城市 LIKE '%香港%';
+------+-----------+------------+-----------------+-------------------+-----------------+------------+
| 省市 | 城市 | 层级 | 地区 人口 [1] | 市辖区 人口 [1] | 城镇 人口 [1] | 普查年月日 |
+------+-----------+------------+-----------------+-------------------+-----------------+------------+
| 香港 | 香港 [2] | 特别行政区 | 7071576 | 7071576 | NULL | 2011-06-30 |
+------+-----------+------------+-----------------+-------------------+-----------------+------------+
可以看到县级市:
mysql> SELECT * FROM cities_population LIMIT 10;
+------+----------+--------+-----------------+-------------------+-----------------+------------+
| 省市 | 城市 | 层级 | 地区 人口 [1] | 市辖区 人口 [1] | 城镇 人口 [1] | 普查年月日 |
+------+----------+--------+-----------------+-------------------+-----------------+------------+
| 北京 | 北京市 | 直辖市 | 19612368 | 18827262 | 16446857 | 2010-11-01 |
| 天津 | 天津市 | 直辖市 | 12938693 | 11090783 | 9562255 | 2010-11-01 |
| 河北 | 石家庄市 | 地级市 | 10163788 | 2834942 | 2770344 | 2010-11-01 |
| 河北 | 唐山市 | 地级市 | 7577289 | 3187171 | 2128191 | 2010-11-01 |
| 河北 | 秦皇岛市 | 地级市 | 2987605 | 1029670 | 967877 | 2010-11-01 |
| 河北 | 邯郸市 | 地级市 | 9174683 | 1445338 | 1316674 | 2010-11-01 |
| 河北 | 邢台市 | 地级市 | 7104103 | 670154 | 668765 | 2010-11-01 |
| 河北 | 保定市 | 地级市 | 11194382 | 1138521 | 1038195 | 2010-11-01 |
| 河北 | 定州市 | 县级市 | 保定市 | 1165182 | 482121 | 2010-11-01 |
| 河北 | 张家口市 | 地级市 | 4345485 | 1060605 | 924628 | 2010-11-01 |
+------+----------+--------+-----------------+-------------------+-----------------+------------+
可以看到数据类型问题,即使命令行先创建也不行(可能是因为脚本里写入SQL我用了replace)
mysql> DESC cities_population;
+------------------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+------------+------+-----+---------+-------+
| province | text | YES | | NULL | |
| city | text | YES | | NULL | |
| class | text | YES | | NULL | |
| total_population | text | YES | | NULL | |
| admin_population | bigint(20) | YES | | NULL | |
| urban_population | text | YES | | NULL | |
| report_time | text | YES | | NULL | |
+------------------+------------+------+-----+---------+-------+
处理香港澳门城市名称问题:
mysql> UPDATE cities_population
-> SET city=CONCAT(LEFT(city, CHAR_LENGTH(city) -1), '')
-> WHERE city NOT LIKE '%市' AND city NOT LIKE '香港%' AND city NOT LIKE '澳门%';
Query OK, 12 rows affected (0.00 sec)
Rows matched: 12 Changed: 12 Warnings: 0
修改字符类型:
mysql> DESC cities_population;
+------------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+-------------+------+-----+---------+-------+
| index | bigint(20) | YES | MUL | NULL | |
| province | text | YES | | NULL | |
| city | varchar(20) | NO | | NULL | |
| class | text | YES | | NULL | |
| total_population | text | YES | | NULL | |
| admin_population | bigint(20) | YES | | NULL | |
| urban_population | text | YES | | NULL | |
| report_time | text | YES | | NULL | |
+------------------+-------------+------+-----+---------+-------+
8 rows in set (0.00 sec)
mysql> ALTER TABLE cities_population
-> MODIFY province VARCHAR(20) NOT NULL,
-> MODIFY class VARCHAR(20) NOT NULL,
-> MODIFY total_population FLOAT NOT NULL,
-> MODIFY urban_population FLOAT NOT NULL,
-> MODIFY report_time DATE;
Query OK, 878 rows affected, 2 warnings (0.03 sec)
Records: 878 Duplicates: 0 Warnings: 2
mysql> DESC cities_population;
+------------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+-------------+------+-----+---------+-------+
| index | bigint(20) | YES | MUL | NULL | |
| province | varchar(20) | NO | | NULL | |
| city | varchar(20) | NO | | NULL | |
| class | varchar(20) | NO | | NULL | |
| total_population | float | NO | | NULL | |
| admin_population | bigint(20) | YES | | NULL | |
| urban_population | float | NO | | NULL | |
| report_time | date | YES | | NULL | |
+------------------+-------------+------+-----+---------+-------+
8 rows in set (0.00 sec)
mysql> CREATE TABLE tmp SELECT DISTINCT * FROM cities_population;
Query OK, 691 rows affected (0.02 sec)
Records: 691 Duplicates: 0 Warnings: 0
mysql> DROP TABLE cities_population;
Query OK, 0 rows affected (0.01 sec)
mysql> ALTER TABLE tmp RENAME TO cities_population;
Query OK, 0 rows affected (0.01 sec)
这样就删除了重复数据:
mysql> ALTER TABLE cities_population
-> ADD PRIMARY KEY(city,report_time);
Query OK, 691 rows affected (0.02 sec)
Records: 691 Duplicates: 0 Warnings: 0
mysql> DESC cities_population;
+------------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+-------------+------+-----+---------+-------+
| index | bigint(20) | YES | | NULL | |
| province | varchar(20) | NO | | NULL | |
| city | varchar(20) | NO | PRI | NULL | |
| class | varchar(20) | NO | | NULL | |
| total_population | float | NO | | NULL | |
| admin_population | bigint(20) | YES | | NULL | |
| urban_population | float | NO | | NULL | |
| report_time | date | NO | PRI | NULL | |
+------------------+-------------+------+-----+---------+-------+
8 rows in set (0.00 sec)