数据清洗和准备
在进行数据分析和建模的过程中,需要花费大量的时间(80%或更多)在数据准备上:加载,清理,转换和重新排列。有时候数据存储在文件或数据库中的方式不适合特定任务的格式。
在本章讨论缺失数据,重复数据,字符串操作,和其他一些分析数据转换。
处理缺失数据
数值用浮点数NaN (Not a Number)表示缺失。
In [1]: import numpy as np In [2]: import pandas as pd In [3]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado']) In [4]: string_data Out[4]: 0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object In [5]: string_data.isnull() Out[5]: 0 False 1 False 2 True 3 False dtype: bool In [6]: string_data[0] = None In [7]: string_data.isnull() Out[7]: 0 True 1 False 2 True 3 False dtype: bool NA相关的处理方法
数据缺失用NA(not available)表示, python内置的None也为NA。
Argument | Description | |
dropna | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how | |
much missing data to tolerate. | | |
fillna | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill' . | |
isnull | Return boolean values indicating which values are missing/NA. | |
notnull | Negation of isnull . | |
In [8]: from numpy import nan as NA In [9]: data = pd.Series([1, NA, 3.5, NA, 7]) In [10]: data.dropna() Out[10]: 0 1.0 2 3.5 4 7.0 dtype: float64 In [11]: data[data.notnull()] Out[11]: 0 1.0 2 3.5 4 7.0 dtype: float64 In [12]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], ....: [NA, NA, NA], [NA, 6.5, 3.]]) In [13]: cleaned = data.dropna() In [14]: data Out[14]: 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0 In [15]: cleaned Out[15]: 0 1 2 0 1.0 6.5 3.0 In [16]: data.dropna(how='all') Out[16]: 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0 In [17]: data[4] = NA In [18]: data Out[18]: 0 1 2 4 0 1.0 6.5 3.0 NaN 1 1.0 NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN 6.5 3.0 NaN In [19]: data.dropna((axis='columns', how='all') Out[19]: 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
how='all'要所有行都为NaN时才会删除。thresh参数可以指定NA的个数。
In [21]: df = pd.DataFrame(np.random.randn(7, 3)) In [22]: df.iloc[:4, 1] = NA In [23]: df.iloc[:2, 2] = NA In [24]: df Out[24]: 0 1 2 0 -0.843340 NaN NaN 1 -1.305941 NaN NaN 2 1.026378 NaN 2.176567 3 0.048885 NaN 0.012649 4 0.591212 -0.739625 1.017533 5 0.633873 -0.124162 -0.823495 6 -1.537827 0.802565 0.359058 In [25]: df.dropna() Out[25]: 0 1 2 4 0.591212 -0.739625 1.017533 5 0.633873 -0.124162 -0.823495 6 -1.537827 0.802565 0.359058 In [26]: df.dropna(thresh=2) Out[26]: 0 1 2 2 1.026378 NaN 2.176567 3 0.048885 NaN 0.012649 4 0.591212 -0.739625 1.017533 5 0.633873 -0.124162 -0.823495 6 -1.537827 0.802565 0.359058
fillna用来对缺失值进行填充。可以针对列进行填充,用上一行的值填充,用平均值填充等。
In [27]: df.fillna(0) Out[27]: 0 1 2 0 -0.843340 0.000000 0.000000 1 -1.305941 0.000000 0.000000 2 1.026378 0.000000 2.176567 3 0.048885 0.000000 0.012649 4 0.591212 -0.739625 1.017533 5 0.633873 -0.124162 -0.823495 6 -1.537827 0.802565 0.359058 In [28]: df.fillna({1: 0.5, 2: 0}) Out[28]: 0 1 2 0 -0.843340 0.500000 0.000000 1 -1.305941 0.500000 0.000000 2 1.026378 0.500000 2.176567 3 0.048885 0.500000 0.012649 4 0.591212 -0.739625 1.017533 5 0.633873 -0.124162 -0.823495 6 -1.537827 0.802565 0.359058 In [29]: _ = df.fillna(0, inplace=True) In [30]: df Out[30]: 0 1 2 0 -0.843340 0.000000 0.000000 1 -1.305941 0.000000 0.000000 2 1.026378 0.000000 2.176567 3 0.048885 0.000000 0.012649 4 0.591212 -0.739625 1.017533 5 0.633873 -0.124162 -0.823495 6 -1.537827 0.802565 0.359058 In [31]: df = pd.DataFrame(np.random.randn(6, 3)) In [32]: df.iloc[2:, 1] = NA In [33]: df.iloc[4:, 2] = NA In [34]: df Out[34]: 0 1 2 0 -0.081265 -0.820770 -0.746845 1 1.150648 0.977842 0.861825 2 1.823679 NaN 1.272047 3 0.293133 NaN 0.273399 4 0.235116 NaN NaN 5 1.365186 NaN NaN In [35]: df.fillna(method='ffill') Out[35]: 0 1 2 0 -0.081265 -0.820770 -0.746845 1 1.150648 0.977842 0.861825 2 1.823679 0.977842 1.272047 3 0.293133 0.977842 0.273399 4 0.235116 0.977842 0.273399 5 1.365186 0.977842 0.273399 In [36]: df.fillna(method='ffill', limit=2) Out[36]: 0 1 2 0 -0.081265 -0.820770 -0.746845 1 1.150648 0.977842 0.861825 2 1.823679 0.977842 1.272047 3 0.293133 0.977842 0.273399 4 0.235116 NaN 0.273399 5 1.365186 NaN 0.273399 In [37]: data = pd.Series([1., NA, 3.5, NA, 7]) In [38]: data.fillna(data.mean()) Out[38]: 0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64
Argument | Description | |
value | Scalar value or dict-like object to use to fill missing values | |
method | Interpolation; by default 'ffill' if function called with no other arguments | |
axis | Axis to fill on; default axis=0 | |
inplace | Modify the calling object without producing a copy | |
limit | For forward and backward filling, maximum number of consecutive periods to fill | |
数据转换
去重
In [39]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], ....: 'k2': [1, 1, 2, 3, 3, 4, 4]}) In [40]: data Out[40]: k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4 6 two 4 In [41]: data.duplicated() Out[41]: 0 False 1 False 2 False 3 False 4 False 5 False 6 True dtype: bool In [42]: data.drop_duplicates() Out[42]: k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4 In [43]: data['v1'] = range(7) In [44]: data.drop_duplicates(['k1']) Out[44]: k1 k2 v1 0 one 1 0 1 two 1 1 In [45]: data.drop_duplicates(['k1', 'k2'], keep='last') Out[45]: k1 k2 v1 0 one 1 0 1 two 1 1 2 one 2 2 3 two 3 3 4 one 3 4 6 two 4 6
交流QQ群:python 测试开发 144081101
付费技术支持wechat: pythontesting
本文代码地址:https://github.com/xurongzhong/mobile_data/
本文最新版本地址:http://t.cn/R8tJ9JH
使用函数或者映射(map)转换数据
import pandas as np import pandas as pd data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'], 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]}) data Out[5]: food ounces 0 bacon 4.0 1 pulled pork 3.0 2 bacon 12.0 3 Pastrami 6.0 4 corned beef 7.5 5 Bacon 8.0 6 pastrami 3.0 7 honey ham 5.0 8 nova lox 6.0 meat_to_animal = { 'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow', 'honey ham': 'pig', 'nova lox': 'salmon' } lowercased = data['food'].str.lower() lowercased Out[8]: 0 bacon 1 pulled pork 2 bacon 3 pastrami 4 corned beef 5 bacon 6 pastrami 7 honey ham 8 nova lox Name: food, dtype: object data['animal'] = lowercased.map(meat_to_animal) data Out[10]: food ounces animal 0 bacon 4.0 pig 1 pulled pork 3.0 pig 2 bacon 12.0 pig 3 Pastrami 6.0 cow 4 corned beef 7.5 cow 5 Bacon 8.0 pig 6 pastrami 3.0 cow 7 honey ham 5.0 pig 8 nova lox 6.0 salmon data['food'].map(lambda x: meat_to_animal[x.lower()]) Out[11]: 0 pig 1 pig 2 pig 3 cow 4 cow 5 pig 6 cow 7 pig 8 salmon Name: food, dtype: object
替换
In [2]: import pandas as pd In [3]: import numpy as np In [4]: data = pd.Series([1., -999., 2., -999., -1000., 3.]) In [5]: data Out[5]: 0 1.0 1 -999.0 2 2.0 3 -999.0 4 -1000.0 5 3.0 dtype: float64 In [6]: data.replace(-999, np.nan) Out[6]: 0 1.0 1 NaN 2 2.0 3 NaN 4 -1000.0 5 3.0 dtype: float64 In [7]: data.replace([-999, -1000], np.nan) Out[7]: 0 1.0 1 NaN 2 2.0 3 NaN 4 NaN 5 3.0 dtype: float64 In [8]: data.replace([-999, -1000], [np.nan, 0]) Out[8]: 0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64 In [9]: data.replace({-999: np.nan, -1000: 0}) Out[9]: 0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64
索引和列名修改
In [2]: import pandas as pd In [3]: import numpy as np In [10]: data = pd.DataFrame(np.arange(12).reshape((3, 4)), ....: index=['Ohio', 'Colorado', 'New York'], ....: columns=['one', 'two', 'three', 'four']) In [11]: data Out[11]: one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 New York 8 9 10 11 In [5]: data.replace(4, 40) Out[5]: one two three four Ohio 0 1 2 3 Colorado 40 5 6 7 New York 8 9 10 11 In [12]: transform = lambda x: x[:4].upper() In [13]: data.index.map(transform) Out[13]: Index(['OHIO', 'COLO', 'NEW '], dtype='object') In [14]: data Out[14]: one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 New York 8 9 10 11 In [15]: data.index = data.index.map(transform) In [16]: data Out[16]: one two three four OHIO 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11 In [17]: data.rename(index=str.title, columns=str.upper) Out[17]: ONE TWO THREE FOUR Ohio 0 1 2 3 Colo 4 5 6 7 New 8 9 10 11 In [18]: data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'}) Out[18]: one two peekaboo four INDIANA 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11 In [19]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True) In [20]: data Out[20]: one two three four INDIANA 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
离散化和面元划分
以下暂略
字符串处理
In [7]: data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', ...: 'Rob': 'rob@gmail.com', 'Wes': np.nan} In [8]: data = pd.Series(data) In [9]: data Out[9]: Dave dave@google.com Rob rob@gmail.com Steve steve@gmail.com Wes NaN dtype: object In [10]: data.isnull() Out[10]: Dave False Rob False Steve False Wes True dtype: bool In [11]: data.str.contains('gmail') Out[11]: Dave False Rob True Steve True Wes NaN dtype: object In [12]: pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})' In [13]: data.str.findall(pattern, flags=re.IGNORECASE) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-13-085c16e4dbfe> in <module>() ----> 1 data.str.findall(pattern, flags=re.IGNORECASE) NameError: name 're' is not defined In [14]: import re In [15]: data.str.findall(pattern, flags=re.IGNORECASE) Out[15]: Dave [(dave, google, com)] Rob [(rob, gmail, com)] Steve [(steve, gmail, com)] Wes NaN dtype: object In [16]: matches = data.str.match(pattern, flags=re.IGNORECASE) In [17]: matches Out[17]: Dave True Rob True Steve True Wes NaN dtype: object In [18]: matches.str.get(1) Out[18]: Dave NaN Rob NaN Steve NaN Wes NaN dtype: float64 In [19]: matches.str[0] Out[19]: Dave NaN Rob NaN Steve NaN Wes NaN dtype: float64 In [20]: data.str[:5] Out[20]: Dave dave@ Rob rob@g Steve steve Wes NaN dtype: object
Method | Description | |
cat | Concatenate strings element-wise with optional delimiter | |
contains | Return boolean array if each string contains pattern/regex | |
count | Count occurrences of pattern | |
extract | Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group | |
endswith | Equivalent to x.endswith(pattern) for each element | |
startswith | Equivalent to x.startswith(pattern) for each element | |
findall | Compute list of all occurrences of pattern/regex for each string | |
get | Index into each element (retrieve i-th element) | |
isalnum | Equivalent to built-in str.alnum | |
isalpha | Equivalent to built-in str.isalpha | |
isdecimal | Equivalent to built-in str.isdecimal | |
isdigit | Equivalent to built-in str.isdigit | |
islower | Equivalent to built-in str.islower | |
isnumeric | Equivalent to built-in str.isnumeric | |
isupper | Equivalent to built-in str.isupper | |
join | Join strings in each element of the Series with passed separator | |
len | Compute length of each string | |
lower, upper | Convert cases; equivalent to x.lower() or x.upper() for each element | |
match | Use re.match with the passed regular expression on each element, returning matched groups as list | |
pad | Add whitespace to left, right, or both sides of strings | |
center | Equivalent to pad(side='both') | |
repeat | Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string) | |
replace | Replace occurrences of pattern/regex with some other string | |
slice | Slice each string in the Series | |
split | Split strings on delimiter or regular expression | |
strip | Trim whitespace from both sides, including newlines | |
rstrip | Trim whitespace on right side | |
lstrip | Trim whitespace on left side | |